Boost logo

Boost :

Subject: Re: [boost] [Heaps] benchmarks
From: Tim Blechmann (tim_at_[hidden])
Date: 2011-06-09 13:41:30


> >> it
> >> would be great if the b_heap implementation performed no worse than the
> >> "naive" version.
> >
> > it cannot: it uses a different tree structure than binary heaps, causing
> > a higher tree. with d-ary heaps, the arity is a tradeoff between tree
> > height and number of comparisons (higher arity, lower tree height but
> > more comparisons). in a similar manner, the b-heap adds a tradeoff
> > between memory locality and tree height ...
> >
> >> Also I've looked at your own benchmark graphs and it seems that there
> >> is a fairly constant performance difference between the different heaps
> >> - I don't see b_heaps getting faster than the alternatives as the queue
> >> size increases.
> >
> > maybe because my workstation has 8mb of cache
>
> Tim, you're proposing this b-heap purely as a performance optimisation
> compared to the "traditional" heap, but we've not yet seen any cases
> where it is actually faster. If it can be faster we need to see some
> examples, and the documentation should qualify when it can be expected
> to be better. If it's not faster, it should be dropped.

just for the record:
core i7 920, x86_64, linux-3.0-rc2, hyperthreading/frequency scaling disabled,
12gb triple-channel ddr3 ram

stressed with:
stress -m 1 --vm-bytes 4G --io 16 -c 3
and two boinc processes running at nice 10

configured the problem size to 15000000
changed work_t from boost::function to int (24byte to 4byte)

comparison of b-heap and dary_heap (with arity 2) ... running all at nice 1 to
increase the number of context switches.

 Performance counter stats for './bench_boostbheap':

     532245.601009 task-clock # 0.865 CPUs utilized
           150,066 context-switches # 0.000 M/sec
             3,497 CPU-migrations # 0.000 M/sec
         1,333,350 page-faults # 0.003 M/sec
 1,421,230,219,856 cycles # 2.670 GHz
   928,444,569,746 stalled-cycles-frontend # 65.33% frontend cycles idle
   643,217,557,503 stalled-cycles-backend # 45.26% backend cycles idle
 1,517,920,671,665 instructions # 1.07 insns per cycle
                                             # 0.61 stalled cycles per insn
   328,233,864,211 branches # 616.696 M/sec
     2,800,247,538 branch-misses # 0.85% of all branches

     615.223322417 seconds time elapsed

 Performance counter stats for './bench_boostdary':

     575427.533478 task-clock # 0.873 CPUs utilized
           160,863 context-switches # 0.000 M/sec
             3,841 CPU-migrations # 0.000 M/sec
         1,333,346 page-faults # 0.002 M/sec
 1,536,548,826,172 cycles # 2.670 GHz
 1,304,937,894,945 stalled-cycles-frontend # 84.93% frontend cycles idle
   999,244,398,224 stalled-cycles-backend # 65.03% backend cycles idle
   747,525,733,877 instructions # 0.49 insns per cycle
                                             # 1.75 stalled cycles per insn
   145,188,482,482 branches # 252.314 M/sec
     1,661,903,337 branch-misses # 1.14% of all branches

     659.415916004 seconds time elapsed

in a second run, i am comparing the cache misses:

 Performance counter stats for './bench_boostbheap':

    11,989,045,556 L1-dcache-load-misses [33.34%]
        96,664,490 L1-dcache-store-misses [50.00%]
     8,082,318,794 L1-dcache-prefetch-misses [33.32%]
     5,457,166,960 LLC-load-misses [49.99%]
        86,358,587 LLC-store-misses [16.67%]
        72,903,925 LLC-prefetch-misses [16.66%]

     602.784184115 seconds time elapsed

 Performance counter stats for './bench_boostdary':

    14,139,852,204 L1-dcache-load-misses [33.33%]
        96,895,924 L1-dcache-store-misses [50.00%]
       752,673,940 L1-dcache-prefetch-misses [33.33%]
     6,081,358,271 LLC-load-misses [49.99%]
        85,684,373 LLC-store-misses [16.67%]
       106,694,619 LLC-prefetch-misses [16.67%]

     652.591757245 seconds time elapsed

... now i tweaked the benchmark:
* using 10 heaps instead of one but only run 5 times instead of 50 - increases
  cache thrashing
* remove work so that the heaps only contains floats (4byte) - more objects per
  cachline
* 4 processes on the idle quad-core machine (2 for default stats, 4 for cache
  misses events) - increases cache thrashing

results:
 Performance counter stats for './bench_boostbheap':

     651624.952248 task-clock # 0.780 CPUs utilized
           138,831 context-switches # 0.000 M/sec
             2,543 CPU-migrations # 0.000 M/sec
           674,172 page-faults # 0.001 M/sec
 1,740,597,175,799 cycles # 2.671 GHz
 1,292,857,574,357 stalled-cycles-frontend # 74.28% frontend cycles idle
   831,566,179,360 stalled-cycles-backend # 47.77% backend cycles idle
 1,441,574,215,357 instructions # 0.83 insns per cycle
                                             # 0.90 stalled cycles per insn
   328,096,551,218 branches # 503.505 M/sec
     1,152,407,287 branch-misses # 0.35% of all branches

     834.898435350 seconds time elapsed

Performance counter stats for './bench_boostdary':

     759395.182963 task-clock # 0.815 CPUs utilized
           156,451 context-switches # 0.000 M/sec
             2,781 CPU-migrations # 0.000 M/sec
           674,170 page-faults # 0.001 M/sec
 2,028,502,096,477 cycles # 2.671 GHz
 1,806,658,397,323 stalled-cycles-frontend # 89.06% frontend cycles idle
   988,635,679,553 stalled-cycles-backend # 48.74% backend cycles idle
   698,231,661,607 instructions # 0.34 insns per cycle
                                             # 2.59 stalled cycles per insn
   145,877,024,263 branches # 192.096 M/sec
       907,579,344 branch-misses # 0.62% of all branches

     931.525434663 seconds time elapsed

 Performance counter stats for './bench_boostbheap':

    23,480,288,668 L1-dcache-load-misses [33.34%]
       264,061,426 L1-dcache-store-misses [50.00%]
    14,720,355,800 L1-dcache-prefetch-misses [33.32%]
     7,914,697,865 LLC-load-misses [49.99%]
        46,737,807 LLC-store-misses [16.67%]
        25,875,481 LLC-prefetch-misses [16.67%]

     832.307436282 seconds time elapsed

 Performance counter stats for './bench_boostdary':

    27,020,681,775 L1-dcache-load-misses [33.33%]
       326,151,939 L1-dcache-store-misses [50.00%]
     3,795,316,756 L1-dcache-prefetch-misses [33.33%]
    10,296,069,595 LLC-load-misses [50.00%]
        45,933,195 LLC-store-misses [16.67%]
        43,923,748 LLC-prefetch-misses [16.67%]

     949.094734119 seconds time elapsed

:)

tim




Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk