|
Boost : |
Subject: Re: [boost] [Heaps] benchmarks
From: Tim Blechmann (tim_at_[hidden])
Date: 2011-06-09 13:41:30
> >> it
> >> would be great if the b_heap implementation performed no worse than the
> >> "naive" version.
> >
> > it cannot: it uses a different tree structure than binary heaps, causing
> > a higher tree. with d-ary heaps, the arity is a tradeoff between tree
> > height and number of comparisons (higher arity, lower tree height but
> > more comparisons). in a similar manner, the b-heap adds a tradeoff
> > between memory locality and tree height ...
> >
> >> Also I've looked at your own benchmark graphs and it seems that there
> >> is a fairly constant performance difference between the different heaps
> >> - I don't see b_heaps getting faster than the alternatives as the queue
> >> size increases.
> >
> > maybe because my workstation has 8mb of cache
>
> Tim, you're proposing this b-heap purely as a performance optimisation
> compared to the "traditional" heap, but we've not yet seen any cases
> where it is actually faster. If it can be faster we need to see some
> examples, and the documentation should qualify when it can be expected
> to be better. If it's not faster, it should be dropped.
just for the record:
core i7 920, x86_64, linux-3.0-rc2, hyperthreading/frequency scaling disabled,
12gb triple-channel ddr3 ram
stressed with:
stress -m 1 --vm-bytes 4G --io 16 -c 3
and two boinc processes running at nice 10
configured the problem size to 15000000
changed work_t from boost::function to int (24byte to 4byte)
comparison of b-heap and dary_heap (with arity 2) ... running all at nice 1 to
increase the number of context switches.
Performance counter stats for './bench_boostbheap':
532245.601009 task-clock # 0.865 CPUs utilized
150,066 context-switches # 0.000 M/sec
3,497 CPU-migrations # 0.000 M/sec
1,333,350 page-faults # 0.003 M/sec
1,421,230,219,856 cycles # 2.670 GHz
928,444,569,746 stalled-cycles-frontend # 65.33% frontend cycles idle
643,217,557,503 stalled-cycles-backend # 45.26% backend cycles idle
1,517,920,671,665 instructions # 1.07 insns per cycle
# 0.61 stalled cycles per insn
328,233,864,211 branches # 616.696 M/sec
2,800,247,538 branch-misses # 0.85% of all branches
615.223322417 seconds time elapsed
Performance counter stats for './bench_boostdary':
575427.533478 task-clock # 0.873 CPUs utilized
160,863 context-switches # 0.000 M/sec
3,841 CPU-migrations # 0.000 M/sec
1,333,346 page-faults # 0.002 M/sec
1,536,548,826,172 cycles # 2.670 GHz
1,304,937,894,945 stalled-cycles-frontend # 84.93% frontend cycles idle
999,244,398,224 stalled-cycles-backend # 65.03% backend cycles idle
747,525,733,877 instructions # 0.49 insns per cycle
# 1.75 stalled cycles per insn
145,188,482,482 branches # 252.314 M/sec
1,661,903,337 branch-misses # 1.14% of all branches
659.415916004 seconds time elapsed
in a second run, i am comparing the cache misses:
Performance counter stats for './bench_boostbheap':
11,989,045,556 L1-dcache-load-misses [33.34%]
96,664,490 L1-dcache-store-misses [50.00%]
8,082,318,794 L1-dcache-prefetch-misses [33.32%]
5,457,166,960 LLC-load-misses [49.99%]
86,358,587 LLC-store-misses [16.67%]
72,903,925 LLC-prefetch-misses [16.66%]
602.784184115 seconds time elapsed
Performance counter stats for './bench_boostdary':
14,139,852,204 L1-dcache-load-misses [33.33%]
96,895,924 L1-dcache-store-misses [50.00%]
752,673,940 L1-dcache-prefetch-misses [33.33%]
6,081,358,271 LLC-load-misses [49.99%]
85,684,373 LLC-store-misses [16.67%]
106,694,619 LLC-prefetch-misses [16.67%]
652.591757245 seconds time elapsed
... now i tweaked the benchmark:
* using 10 heaps instead of one but only run 5 times instead of 50 - increases
cache thrashing
* remove work so that the heaps only contains floats (4byte) - more objects per
cachline
* 4 processes on the idle quad-core machine (2 for default stats, 4 for cache
misses events) - increases cache thrashing
results:
Performance counter stats for './bench_boostbheap':
651624.952248 task-clock # 0.780 CPUs utilized
138,831 context-switches # 0.000 M/sec
2,543 CPU-migrations # 0.000 M/sec
674,172 page-faults # 0.001 M/sec
1,740,597,175,799 cycles # 2.671 GHz
1,292,857,574,357 stalled-cycles-frontend # 74.28% frontend cycles idle
831,566,179,360 stalled-cycles-backend # 47.77% backend cycles idle
1,441,574,215,357 instructions # 0.83 insns per cycle
# 0.90 stalled cycles per insn
328,096,551,218 branches # 503.505 M/sec
1,152,407,287 branch-misses # 0.35% of all branches
834.898435350 seconds time elapsed
Performance counter stats for './bench_boostdary':
759395.182963 task-clock # 0.815 CPUs utilized
156,451 context-switches # 0.000 M/sec
2,781 CPU-migrations # 0.000 M/sec
674,170 page-faults # 0.001 M/sec
2,028,502,096,477 cycles # 2.671 GHz
1,806,658,397,323 stalled-cycles-frontend # 89.06% frontend cycles idle
988,635,679,553 stalled-cycles-backend # 48.74% backend cycles idle
698,231,661,607 instructions # 0.34 insns per cycle
# 2.59 stalled cycles per insn
145,877,024,263 branches # 192.096 M/sec
907,579,344 branch-misses # 0.62% of all branches
931.525434663 seconds time elapsed
Performance counter stats for './bench_boostbheap':
23,480,288,668 L1-dcache-load-misses [33.34%]
264,061,426 L1-dcache-store-misses [50.00%]
14,720,355,800 L1-dcache-prefetch-misses [33.32%]
7,914,697,865 LLC-load-misses [49.99%]
46,737,807 LLC-store-misses [16.67%]
25,875,481 LLC-prefetch-misses [16.67%]
832.307436282 seconds time elapsed
Performance counter stats for './bench_boostdary':
27,020,681,775 L1-dcache-load-misses [33.33%]
326,151,939 L1-dcache-store-misses [50.00%]
3,795,316,756 L1-dcache-prefetch-misses [33.33%]
10,296,069,595 LLC-load-misses [50.00%]
45,933,195 LLC-store-misses [16.67%]
43,923,748 LLC-prefetch-misses [16.67%]
949.094734119 seconds time elapsed
:)
tim
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk