From: Peter Dimov (pdimov_at_[hidden])
Date: 2007-08-23 10:35:11
Howard Hinnant wrote:
> On Aug 22, 2007, at 2:59 PM, Zach Laine wrote:
>> Could the folks who object to the
>> current design spell it out for me a bit more explicitly -- what in
>> the design is dangerous/inconvenient enough to throw away one or more
>> of the 4 goals?
> I would like to see an answer to Zach's question too. I do not know
> what the major objection is with the "current proposal". I only know
> that people are suggesting alternatives.
I deliberately only submitted an alternative for consideration instead of
poking holes in your design or argumentation. But if you insist...
I don't believe that my suggested alternative throws away any of the four
goals, hypothetical malicious vendor D notwithstanding. It is still possible
for the vendor to meet the goals. In addition, it adds
Goal 5: the ability to control the level of checking globally;
Goal 6: ... without source changes or even recompilation.
This can help if hypothetical user D, lured by the "no overhead, less L1
cache misses!" slogan, uses unchecked<> as a matter of habit. "Checking is
for other people."
On reflection though, I'll change the constructor from
explicit condition( Mutex * pm = 0 );
explicit condition( Mutex * pm );
as it's too easy to accidentally disable checking by not including the
condition in the member init list.
Going back to:
typedef mutex mutex_t;
typedef condition<unchecked<mutex_t>> cond_t;
and the L1 cache miss argument for:
1. The size of shared_mutex, according to your numbers, is 104. If we assume
16 bytes of state in A, this makes sizeof(A) 120. The addition of two
pointers makes it 128. This is a 7% increase, but it also happens to round
up the size of A to 128, which makes it never straddle a cache line, so the
"more bloated" version will actually be faster. Note that I picked the
number 16 before realizing that. :-) If A had 24 bytes of state the two
pointers would of couse be detrimental.
2. I'm having a hard time imagining a program where the L1 cache misses due
to the increased size of A would matter. An object of type A allocates a not
insignificant amount of kernel resources, so it would be hard to keep so
many A's in memory for the +7% L1 misses to show up.
3. What is sizeof(pthread_rwlock_t) on your platform? Is it not something
like 48-56? This is half the size of the above shared_mutex, so users who
are L1 cache miss conscious will not use the shared_mutex anyway.
4. The vector v would need to be protected by its own rwlock as well since
you can't reallocate it while someone is accessing the A's, which creates a
central bottleneck. vector< shared_ptr<A> > does not suffer from this
problem as you can reallocate it while someone is holding a reference to an
element in the form of shared_ptr.
5. If optimizing the L1 cache friendliness of shared_mutex is an important
goal, I would consider moving the conditions to the heap as they aren't
accessed on the fast path.
void * rest_;
I know that your actual shared_mutex will use atomics on state_, so you
could even be able to make
unsigned state_; // , state2_?
void * rest_;
work. Uncontended access now doesn't touch *rest_ and L1 cache misses are a
fraction of what they used to be. Contended access does cause more cache
misses, but this is shadowed by the cost of the contention.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk