Boost logo

Boost :

From: Howard Hinnant (hinnant_at_[hidden])
Date: 2004-01-28 20:09:57


I've been reviewing this thread, and synchronization primitives with
the PPC instruction set in general. I've looked especially at
Alexander's comments. And I've done stress testing and performance
testing on Mac OS X, both CFM and Mach-O platforms.

After reviewing an old PowerPC 601 User's Manual which has an appendix
on synchronization examples, which does take into account
multi-processor architectures, I've made slight modifications to
Martin's/Howard's posted code, namely the addition of isync to the
constructor, and sync to the destructor:

inline
lightweight_mutex::scoped_lock::scoped_lock(lightweight_mutex & m):
m_(m)
{
     register volatile int* p = &m_.a_;
     register int f;
     register int one = 1;
     asm
     {
     loop:
         lwarx f, 0, p
         cmpwi f, 0
         bne- yield
         stwcx. one, 0, p
         beq+ done
     }
     yield:
     yield_thread();
     goto loop;
     asm
     {
     done:
         isync
     }
}

inline
lightweight_mutex::scoped_lock::~scoped_lock()
{
     asm
     {
         sync
     }
     m_.a_ = 0;
}

The 601 UM has example code that looks very much like this.

To test this code I created a volatile global int that a test function
would both increment and decrement, testing values before and after
every change, and of course locking a mutex so that only one thread
could enter the test function at a time. 10 threads are set up to loop
over the test function a million times. Optimizations are full on, so
as to collect timing information as well.

I used Metrowerks::threads to create threads, and to provide
OS-supported mutexes for comparison. Metrowerks::threads is very
similar to boost::threads. On the Mach-O platform Metrowerks::threads
is a thin inlined wrapper over the native pthread library (BSD).

volatile int global = 0;

#ifdef NDEBUG
#undef NDEBUG
#endif

#include <cassert>

#include <msl_thread>

#ifdef LWM
lightweight_mutex mut;
#else
Metrowerks::mutex mut;
#endif

void test()
{
#ifdef LWM
     lightweight_mutex::scoped_lock lock(mut);
#else
     Metrowerks::mutex::scoped_lock lock(mut);
#endif
     assert(global == 0);
     ++global;
     assert(global == 1);
     --global;
     assert(global == 0);
}

void run_test()
{
     for (unsigned i = 0; i < 0x000FFFFF; ++i)
         test();
}

#include <iostream>
#include <time.h>

int main()
{
     const int num_thread = 10;
     clock_t t = clock();
#ifndef UNCONTESTED
     Metrowerks::thread_group g;
     for (int i = 0; i < num_thread; ++i)
         g.create_thread(run_test);
     g.join_all();
#else
     for (int i = 0; i < num_thread; ++i)
         run_test();
#endif
     t = clock() - t;
     std::cout << "Done in " << t/(double)CLOCKS_PER_SEC << " seconds\n";
}

The test can be run in "contested" or "uncontested" mode. The
contested mode creates 10 threads to run run_test() simultaneously.
The uncontested mode simply calls run_test() 10 times. In the
uncontested mode the mutexes are still locked and unlocked the same
number of times, but the lock is guaranteed to not be held by another
thread.

Also as a sanity check I created a "naive" scoped_lock:

inline
lightweight_mutex::scoped_lock::scoped_lock(lightweight_mutex & m):
m_(m)
{
     while (m_.a_)
         yield_thread();
     m_.a_ = 1;
}

inline
lightweight_mutex::scoped_lock::~scoped_lock()
{
     m_.a_ = 0;
}

The purpose of this version was to insure that the test is actually
testing what I was intending. Sure enough, this variant consistently
asserted on the Mach-O platform and hung or crashed on the CFM platform
when run in contested mode.

All testing was done on a dual 500 Mhz G4 running Mac OS X 10.3. An
activity monitor was checked periodically during the tests and showed
that both processors were being utilized to run the test.

Performance Data:

OS: Used MP tasks on CFM, and pthreads on Mach-O
LWM: lightweight_mutex
naive: see "naive" description above

All times in seconds, lower is faster/better.

+--------------------+-------+--------+---------+
| | OS | LWM | naive |
+--------------------+-------+--------+---------+
| CFM contested | 248 | 1.86 | crash |
+--------------------+-------+--------+---------+
| CFM uncontested | 28.5 | 1.79 | 0.994 |
+--------------------+-------+--------+---------+
| Mach-O contested | 202 | 3.03 | assert |
+--------------------+-------+--------+---------+
| Mach-O uncontested | 3.26 | 1.69 | 0.800 |
+--------------------+-------+--------+---------+

 From the timings there appears to be sufficient motivation to explore
the lightweight_mutex on these platforms.

 From a correctness standpoint I can not find any faults of the lwm
through testing. I even bumped the number of threads up to 100, and
the number of test cycles up to 256 million for a contested LWM run
(Mach-O only). The run took 146 minutes on my 2 x 500 Mhz machine and
indicated absolutely zero problems (I did not have the patience to make
such a run with the OS-based synchronization). I monitored the test to
insure that all 100 worker threads were active for the great majority
of this time. I also noted that the test consumed approximately 1.85
processors (on average, up to 1.90 processors when I was reading the
paper, down to 1.15 processors when I was surfing the web).

Alexander, I have no doubt that you will see problems with this
implementation, and I appreciate your comments. Could you please
comment specifically on how the lwarx, stwcx., isync and sync
instructions do not address your concerns, thanks. Or could you
perhaps suggest a test I could realistically run to expose problems.

-Howard


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk