#define BOOST_SMT_PAUSE __asm__ __volatile__( "rep; nop" : : : "memory" );
 
We should probably use:
        __asm__ __volatile__("pause;")
 
Why? Because one thread polling( or worse CASing) memory, causes multiple overlapping memory operations to be made:
 
"On a processor with a super-scalar speculative execution engine, a fast spin-wait loop results in the issue of multiple read requests by the waiting thread as it rapidly goes through the loop. These requests po tentially execute out-of-order. When the processor detects a write by one thread to any read of the same data that is in progress from another thread, the processor must guarantee that no violations of memory order occur. To ensure the proper order of outstanding memory operations, the processor incurs a severe penalty. The penalty from memory order violations can be reduced significantly by inserting a PAUSE instruction in the loop. This eliminates multiple loop iterations in the pipeline."
 
The effect can be bad when hyper-threading.
 
http://software.intel.com/en-us/articles/long-duration-spin-wait-loops-on-hyper-threading-technology-enabled-intel-processors/
read the PDF Using Spin-Loops on Intel® Pentium® 4 Processor and Intel® Xeon™ Processor:
http://software.intel.com/file/25602
 
A Intel forum posting  "rep; nop" is faster, but when you're spinning 40 clock cycles is fast enough.
http://software.intel.com/en-us/forums/showthread.php?t=48371
 
 
Also it may make sense to if a CAS fails to poll by reading instead of issuing more CASes.  Trying to do a CAS invalidates the cache line.  If several threads are trying lock the same spin mutex, this can cause a lot of unnessesary noise on the bus.  If they do a polling read each CPU can have its own cached copy of the cache line, which is invalidated when the mutex is released.
 
The points in the Intel paper are probably sound on other SMP platforms as well.
 
Chris Hite