From: Howard Hinnant (hinnant_at_[hidden])
Date: 2006-10-26 13:51:22
On Oct 26, 2006, at 12:39 PM, Peter Dimov wrote:
> Howard Hinnant wrote:
>> Here is my implementation of Terekhov's *fair* read/write algorithm,
>> with upgradable thrown in. This algorithm is just as you say: The
>> OS alone decides who gets to run next.
> There are two interesting decisions in a read/write mutex. The
> first is what
> happens when a writer is blocked in wrlock() waiting for readers
> and another
> reader tries to rdlock(). POSIX leaves this implementation defined
> in the
> general case; however, if priority scheduling is supported, the reader
> "shall" obtain a lock if and only if it's of a higher priority
> compared to
> the waiting writer (or all writers, if there's more than one).
> If we assume equal priority for simplicity, this means that rdlock
> block. This is detectable by a conformance test, by the way, since
> fails if and only if rdlock blocks.
> This is the "gate 1" part, and all of the algorithms I've been
> with also have this property. If the 'writer pending' bit is set,
> no readers
> are allowed to obtain a lock. However if another, higher-priority
> arrives after the pending bit is already set, I still grant write
> access to
> the first writer, which is probably a violation of POSIX [TPS]. :-)
> The other interesting decision is what happens on unlock (two
> subparts, last
> reader unlocks, or writer unlocks.) Your implementation seems to
> use the
> "thundering herd" approach of waking all. This is fair, but may be
> if you wake up several writers, or a writer and several readers and
> writer manages to pass through gate 1 (everyone else must block
> (This redundant wakeup may not be visible in a benchmark, but it
> can steal
> CPU from other busy non-contending threads in a real application.)
> (I see a significant amount of CAS failures even when waking up only
> readers; again, this doesn't affect the timing of the specific
> and I'm not sure that it's actually a problem.)
> Another approach is to wake up one writer in rdunlock and all
> readers in
> wrunlock (alternating between readers and writers). I also have a
> option in mind that I haven't tested yet.
Very nice summary and astute observations.
I must admit to really liking the "thundering herd" approach for the
interesting unlock cases you site. This is where you're really
turning things over to the OS and saying "you decide". And I just
flat haven't experimented with priority and consider myself currently
unqualified to comment in that area. Which I guess is another reason
for me to prefer to let the OS decide such matters (it presumably
knows and picks threads with higher priority to wake).
Last reader unlocks:
In this case if there is a writer waiting (at gate 2), then (imho)
there's no decision but to give the writer the ago, and this is
assuming equal priority as you stated.
If there is not a writer waiting at gate 2, and this is the last
reader, then my implementation does absolutely nothing. A single
reader inside of gate 1 blocks no one else from passing through gate
1. There is no thundering herd for last reader unlock. (see
By definition no one is waiting at gate 2. There may be waiters,
both read and write (and upgradable if you're using that) at gate 1.
If you choose at this point, a reader or a writer, you are, by
definition, giving one of the two priority, at least for this next
round. One might have the write unlock alternate its choice, but I'm
not seeing much performance gain here. You either wake up 1 writer,
or you wake up many readers (on the reader path it might actually be
significantly more expensive to make num_reader calls to notify_one()
than just do a notify_all()).
Let's look at a typical (anticipated) use case under high contention
(we're not even discussing or worried about low contention use cases
for this discussion): Few (nw) writers blocked, many (nr) readers
blocked; writer unlocks, and issues notify_all().
Chances are good that one of the readers will get past gate 1 before
one of the writers gets past gate 1, just because there are so many
of them compared to the number of writers (strength in numbers).
Chance first thread past gate 1 is a reader: nr/(nr+nw)
The above formula rapidly approaches 100% as nw/nr approaches 0.
nw/nr Chances of reader being first
0 100% // no writers
1.0 50% // same number of writers as readers
Indeed it is likely that several, or even many readers will get past
gate 1 before one of the few writers is chosen to try. When one of
the few writers is finally chosen, it too will get past gate 1 and
block at gate 2. At this point you've got many readers inside of
gate 1 getting work done. You have only a few writers (nw-1) that
woke up for nothing and now have to go back to sleep. And (on
average) you'll have relatively few readers that woke up for nothing
and must sleep again (just how many, statistically speaking, drops
with the nw/nr ratio -- probability formula left as an exercise to
the reader :-)).
Of course there will be cases where one of the few writers makes it
past gate 1 at or near the beginning of the thundering herd, in which
case many threads will be woken up for nothing. However,
statistically speaking, and assuming all equal probability for any
thread getting woken up, this scenario should happen infrequently,
and thus not be a performance problem.
If you have an atypical scenario: Many writers, few readers, then I
agree that the "thundering herd" approach may exhibit performance
problems, as it is then likely that a writer is first, and every
other thread is woken up for nothing. And in this scenario, I
suggest that a read/write mutex is probably not a good design choice
for the application.
There's a big gray area inbetween where nr approximately equals nw.
It would be interesting to see a better analysis of that scenario,
and discussion of the applicability of read/write mutexes there.
Another thing I like about this approach is that it is relatively
simple to add the upgradable concept to it. And there is no extra
cost for supporting upgradable. One gets it practically for free.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk