Boost logo

Boost :

Subject: Re: [boost] Request for feedback on design of concurrent_unordered_map, plus notes on use of memory transactions
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2014-07-28 06:08:26


On 28 Jul 2014 at 13:26, Gavin Lambert wrote:

> On 28/07/2014 03:20, Niall Douglas wrote:
> > * item_type is allocated via a rebind of the supplied allocator, and
> > looks like:
> >
> > struct item_type
> > {
> > value_type p;
> > size_t hash;
> > };
>
> I presume this will support partial construction to handle the case when
> value_type does not have a default constructor, or is movable-only, etc?

My partial implementation will only support move construction, as
that is all I need to store shared_ptr.

You're right that there is a small object vs large object
optimisation opportunity here.

> > * To insert or find an item involves determining the bucket from the
> > modulus of the hash by the number of buckets, locking the bucket and
> > scanning the linear list for an empty item or the right item as
> > appropriate. We know empty items from a magic hash value which is
> > defined to some magic constant (advice on the best magic hash value
> > least likely to turn up from std::hash is welcome).
>
> Isn't it by definition (admittedly reality is less ideal) that all hash
> values have equal probability? What happens if a real item does end up
> with the magic "empty" hash value? I think it'd be safer to have a
> separate flag for this case.

Firstly, if the magic hash value ever turns up during insert the
program aborts :) It should, hopefully, be exceedingly rare on 64 bit
to the point that we don't care [1].

Secondly, a separate flag costs another 8 bytes, and it ruins the
cache line symmetry for AFIO where items size out to 32 bytes which
is a lovely multiple.

[1]: On libstdc++ a hash of a size_t is that size_t. As AFIO
increments a size_t from 1 onwards, any magic value with a top bit
set is safe. MSVC tries rather harder and for a size_t it uses the
FNV-1a hash function which is rated as second best overall (but best
for numbers) at
https://programmers.stackexchange.com/questions/49550/which-hashing-al
gorithm-is-best-for-uniqueness-and-speed. FNV-1a looks pretty random
to me.

> > * To erase an item is easy: iterators store an iterator to the bucket
> > and an offset into the linear table. Erasure involves constructing a
> > temporary value_type, locking the bucket and performing a *move*
> > construction out of the table into the temporary value_type,
> > resetting the hash to empty. Destruction can then occur without
> > holding the bucket lock. If the item just erased is the last in the
> > bucket, the bucket's size (not capacity) is reduced by one, and that
> > is repeated until the bucket size is the minimal possible.
>
> If the hash is reset to empty and the lock released before destruction
> occurs, doesn't that mean that something else could try to concurrently
> construct into the same storage? After all, the item slot will now look
> empty to insert().

We have to be careful here about "destruction of what?". The
assumption is that the value_type inside the item_type is left in a
low cost or default state after being moved out of it, so the lock is
held during the move, after which that slot will be available for use
by someone else who may claim the lock and move construct into that
storage.

If someone held a reference to that value, then it would suddenly
change contents in a racy way. But it would still point to an
unrelocated value of that type, and therefore I believe is safe for
most uses.

As destruction is generally slow (free() etc), that's why the bucket
lock isn't held for it. It makes an enormous difference to
concurrency anyway.

> Similarly the bucket size could be reduced by one
> just as insert() puts something into that bucket slot (thinking that it
> was still within the valid range, so not trying to increment it itself).

The lock is held during all resizes of the bucket.

> > * size() under this design becomes O(bucket count) as we sum the
> > count members of each bucket. empty() is worst case O(bucket count)
> > when the map is empty, best case O(1) when the first bucket is not
> > empty, average case O(bucket count/item count/2).
>
> You could use a relaxed count at the top level to improve this,
> possibly.

I didn't mention the atomic count in each bucket as I thought it
obvious, but it holds the number of valid entries in each bucket.
This saves size() or empty() having to iterate the bucket tables.

> size() and empty() become statistical rather than
> prescriptive in the presence of concurrent writes anyway, so a stale
> value is not harmful. Although I suppose conversely updating a shared
> counter will cause cache contention performance loss for insert/erase,
> which may not be worth it for the benefit of statistical functions that
> are less likely to even be called in a concurrent design.

Exactly. It's trivial to keep a top level atomic count, but the
concurrency consequences are dire. I do really wish I knew of a
better way for empty() to have constant time though :(

> > Thoughts on this design before I write a final implementation are
> > welcome. I don't plan to complete my implementation past the minimum
> > needed for proposed Boost.AFIO, so if someone else wishes to finish
> > it that and make it part of official Boost that would be great.
>
> I presume you've looked at other common implementations of this sort of
> thing? eg. libcds.

I did do quite a bit of prior art research, and was surprised that no
one seems to have had a crack at a STL compliant
concurrent_unordered_map apart from the Intel TBB and Microsoft PPL
implementations, both of which don't allow concurrent erase which is
a showstopper. I don't have the time to complete my implementation
past what I need now, but I was thinking it would make a great GSoC
for 2015 and Boost (and the STL) is in sore need of a decent
concurrent_unordered_map with a different set of design compromises
to the split ordered list designs.

The libcds implementations don't follow the STL API, and unless I
missed something they get their speed from some clever memory
allocation tricks which require threads to call init and cleanup
routines to initialise thread local state. I didn't look deeply here,
so if I am wrong please do correct.

Niall

-- 
ned Productions Limited Consulting
http://www.nedproductions.biz/ 
http://ie.linkedin.com/in/nialldouglas/



Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk