Boost logo

Boost Users :

Subject: Re: [Boost-users] [iostreams] Devices and WOULD_BLOCK
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2015-01-30 06:47:52


On 27 Jan 2015 at 8:16, Brian Budge wrote:

> I got distracted by the ~1us estimate you gave here. I just wrote a
> quick benchmark for an uncontended fetch_add + compare and repeat, and
> came up with about 22 cycles total per iteration, which is about 7 ns
> per iteration. If I use a volatile int instead of an atomic, it is
> just over 2 ns per iteration. It's more expensive, but it seems to
> be less than an order of magnitude, rather than the 3 orders of
> magnitude mentioned above. Here's the code for posterity.

You may find the results at
https://ci.nedprod.com/view/Boost%20Thread-Expected-Permit/job/Boost.S
pinlock%20Test%20Linux%20GCC%204.8/228/console of interest. Some
figures for Haswell uncontended:

=== Binary spinlock performance ===
1. Achieved 102531608.982708 transactions per second
2. Achieved 102767464.092175 transactions per second
3. Achieved 102837332.390959 transactions per second

=== Tristate spinlock performance ===
1. Achieved 97338235.338779 transactions per second
2. Achieved 99064689.648506 transactions per second
3. Achieved 99594486.110628 transactions per second

=== Pointer spinlock performance ===
1. Achieved 85935625.193489 transactions per second
2. Achieved 85551904.532972 transactions per second
3. Achieved 85074926.977000 transactions per second

Haswell contended:

=== Binary spinlock performance ===
1. Achieved 100056328.085670 transactions per second
2. Achieved 99038604.412362 transactions per second
3. Achieved 93814414.369464 transactions per second

=== Tristate spinlock performance ===
1. Achieved 73303113.800913 transactions per second
2. Achieved 87909718.117258 transactions per second
3. Achieved 66449661.784678 transactions per second

=== Pointer spinlock performance ===
1. Achieved 75884031.753741 transactions per second
2. Achieved 80199058.554479 transactions per second
3. Achieved 78455657.805638 transactions per second

One can draw from this that atomics are fast even when contended if
and only if the cache line invalidation coherency traffic is kept
below the CPU's coherency bus bandwidth. An uncontended
unordered_map:

=== Large unordered_map spinlock write performance ===
1. Achieved 18456343.370007 transactions per second
2. Achieved 18493407.792700 transactions per second
3. Achieved 18589912.112064 transactions per second

Versus a contended one:

=== Large unordered_map spinlock write performance ===
1. Achieved 17174649.408718 transactions per second
2. Achieved 17177112.056269 transactions per second
3. Achieved 17468407.872320 transactions per second

The performance of atomics if you're keeping cache line invalidations
low isn't the problem - modern CPUs are very good on that. The big
performance problem *introduced* by the use of atomics is that it
equals you telling the compiler that global state is changing in a
way the compiler does not understand. This means that the compiler
cannot eliminate any code with a dependency touching any atomic.

In my idea testing for non-allocating future designs a few months
ago, even a non-atomic boolean which when flipped true "turned on"
the use of atomics over non-atomics made the difference between
thousands of opcodes being generated and less than five opcodes. The
atomics are so fundamentally anti-optimiser that their use ought to
be avoided in header only code as much as possible, they are
incredibly penalising. In non header only code I wouldn't worry, even
with link time optimisation present compilers do a poor job of
optimising past ABI boundaries.

Niall

-- 
ned Productions Limited Consulting
http://www.nedproductions.biz/ 
http://ie.linkedin.com/in/nialldouglas/



Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net