Boost logo

Boost :

Subject: Re: [boost] Non-allocating future promise... Re: ASIO into the standard (was: Re: C++ committee meeting report)
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2014-07-08 07:57:09


On 7 Jul 2014 at 1:50, Gottlob Frege wrote:

> https://github.com/boostcon/cppnow_presentations_2013/blob/master/mon/future_promise.pdf?raw=true
>
> 2013. The talk is called "Non-Allocating std::future/promise". I
> think most of it is about a... non-allocating future-promise.
> Hopefully. That was at least the idea.

Dear dear dear ... given that I was working at BlackBerry with you at
the time, and went with you to C++ Now, I really don't know how I
missed this.

I'm going to assume that I didn't miss this and instead choose to
forget and then pretend that your idea was my idea. So, my apologies,
and I'll credit you in the docs when the time comes.

> In brief:
> - each has a pointer to the other
> - the result is stored in the future
> - if one moves or is destroyed, it is... tricky. But basically a CAS
> and a small spin lock in a rare case when you need to wait for the
> other side to update the pointer.

The problem with this is it isn't space optimal. Space optimal means
storing the spinlock in the low bits of the memory pointer instead of
wasting another four or eight bytes on it, which is what I've done.
Performance is indeed impacted by a bit, but not that much - after
all it's all the same cache line anyway.

I also didn't see lock backoff in your example code i.e. where both
the promise and future are concurrently moved from different threads.
Did I miss something?

> Yes, replacing the spin and pointer updating with TM would be nice.

And here is where things become very interesting.

I started off assuming, as you would, that TM has two big wins:

1. You can update two or more disparate pointers atomically. Very
handy for lock and wait free removal of items from linked lists.

2. You can skip locking where concurrent updating isn't a problem.
How this works is you take care to volatile read from all those cache
lines which if written to without locks would corrupt state, then you
write as little as possible yourself.

Where I have made a mistake is in assuming that TM is free. It is
not: Intel TSX HLE is significantly slower for the single threaded
case, while TSX RTM is not free to configure. For a simple two
pointer update, using RTM is slower than a spinlock for example. If
you compile with transactional GCC, all your code becomes 40% slower
so you don't win in performance there either.

Which suggests that TSX really needs a bigger chunk of transaction to
make the overhead worth it, and hence why I thought I should write a
TSX concurrent_unordered_map which, as I learned last night, runs at
about 40% the single threaded performance of a spin locked
unordered_map which is unacceptable. A split ordered list
implementation using cmpxchg is twice quicker, but has the cost that
erase is unsafe.

So back to the drawing board again. I'm now thinking of simplifying
by requiring that the mapped type is a shared_ptr<T> and I'll see
what sort of design that might yield. I am finding that using TM is
repeatedly not worth it so far due to the costs on single theaded
performance, far more frequently than I expected. Maybe the next
Intel chip will improve TSX's overheads substantially.

Niall

-- 
ned Productions Limited Consulting
http://www.nedproductions.biz/ 
http://ie.linkedin.com/in/nialldouglas/



Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk