Boost logo

Boost :

Subject: Re: [boost] Request for feedback on design of concurrent_unordered_map, plus notes on use of memory transactions
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2014-07-28 18:44:10


On 28 Jul 2014 at 7:21, Rob Stewart wrote:

> The worst case, add I see it, is that you simply increment (or
> decrement) a computed hash value should the magic value ever arise. That
> is, you increase the odds of a collision so that no value ever uses the
> magic value.

That works for me. Great idea, thank you.

Here are some benchmarks for a first attempt at the design I proposed
earlier. The tests involve reserving 10,000 capacity and inserting
5,000 items to try and remove the overhead of rehashing for
unordered_map. The write test is a random burst of up to 128
insertions or deletions which may be done concurrently by as many
threads as the CPU has (8 hyperthreaded ones). The read test is as
many finds as it can do concurrently.

I tested single threaded and 8 threaded to ensure the concurrent
design was not slower than a spinlocked std::unordered_map. First
single threaded:

=== Large unordered_map spinlock write performance ===
1. Achieved 23548585.078085 transactions per second

=== Large unordered_map spinlock read performance ===
1. Achieved 64616938.960940 transactions per second

=== Large concurrent_unordered_map write performance ===
There are 1 threads in this CPU
1. Achieved 36929611.532615 transactions per second

=== Large concurrent_unordered_map read performance ===
There are 1 threads in this CPU
1. Achieved 66784946.368549 transactions per second

As concurrent_unordered_map doesn't have to check the load factor
during inserts and erases, it is quicker single threaded.

Eight threaded:

=== Large unordered_map spinlock write performance ===
1. Achieved 22012179.855850 transactions per second

=== Large unordered_map spinlock read performance ===
1. Achieved 42627903.829194 transactions per second

=== Large concurrent_unordered_map write performance ===
There are 8 threads in this CPU
1. Achieved 166278672.013391 transactions per second

=== Large concurrent_unordered_map read performance ===
There are 8 threads in this CPU
1. Achieved 165071376.646606 transactions per second

The contention on the read spinlock reduces unordered_map performance
over the single threaded case significantly. Meanwhile my
concurrent_unordered_map sees a 4.5x insert/erase improvement and a
2.5x find improvement, in fact both writing and reading the
concurrent_unordered_map are now both the same performance under
maximum contention.

Someone is bound to ask how does this compare to a split ordered list
concurrent_unordered_map? Well, for Microsoft PPL's one using VS2013:

=== Large concurrent_unordered_map read performance ===
There are 4 threads in this CPU
Mine: Achieved 37400530.145442 transactions per second
PPL: Achieved 54401408.200488 transactions per second

So PPL is about 1.5x of mine for reads. PPL can't do concurrent erase
though, so I can't test the writes.

My design is also intended for transactional GCC, and there I see the
following benchmarks:

Single threaded:

=== Large unordered_map spinlock write performance ===
1. Achieved 15925637.896608 transactions per second

=== Large unordered_map spinlock read performance ===
1. Achieved 52702345.715530 transactions per second

=== Large concurrent_unordered_map write performance ===
There are 1 threads in this CPU
1. Achieved 19225268.992226 transactions per second

=== Large concurrent_unordered_map read performance ===
There are 1 threads in this CPU
1. Achieved 18177607.124302 transactions per second

Something is bad here with my single threaded read performance, it's
way low. I'm going to assume it's a bug in transactional GCC, see
below for what I mean.

Eight threaded:

=== Large unordered_map spinlock write performance ===
1. Achieved 13913300.037094 transactions per second

=== Large unordered_map spinlock read performance ===
1. Achieved 32325773.115938 transactions per second

=== Large concurrent_unordered_map write performance ===
There are 8 threads in this CPU
1. Achieved 95011747.620646 transactions per second

=== Large concurrent_unordered_map read performance ===
There are 8 threads in this CPU
1. Achieved 109675855.394830 transactions per second

My concurrent_unordered_map sees a 5x insert/erase improvement and a
6x find improvement. That surely means my code is in fact good and is
parallelising nicely.

Obviously these are early results, I still have much to do. It is
looking like it should provide a nice wee boost for AFIO's
performance for sure.

Niall

-- 
ned Productions Limited Consulting
http://www.nedproductions.biz/ 
http://ie.linkedin.com/in/nialldouglas/



Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk