Boost logo

Boost :

Subject: Re: [boost] Non-allocating future promise... Re: ASIO into the standard (was: Re: C++ committee meeting report)
From: Lee Clagett (forum_at_[hidden])
Date: 2014-07-08 22:29:16


On Tue, Jul 8, 2014 at 7:57 AM, Niall Douglas <s_sourceforge_at_[hidden]>
wrote:

> On 7 Jul 2014 at 1:50, Gottlob Frege wrote:
>
> >
> https://github.com/boostcon/cppnow_presentations_2013/blob/master/mon/future_promise.pdf?raw=true
> >
> > 2013. The talk is called "Non-Allocating std::future/promise". I
> > think most of it is about a... non-allocating future-promise.
> > Hopefully. That was at least the idea.
>
> Dear dear dear ... given that I was working at BlackBerry with you at
> the time, and went with you to C++ Now, I really don't know how I
> missed this.
>
> I'm going to assume that I didn't miss this and instead choose to
> forget and then pretend that your idea was my idea. So, my apologies,
> and I'll credit you in the docs when the time comes.
>
> > In brief:
> > - each has a pointer to the other
> > - the result is stored in the future
> > - if one moves or is destroyed, it is... tricky. But basically a CAS
> > and a small spin lock in a rare case when you need to wait for the
> > other side to update the pointer.
>
> The problem with this is it isn't space optimal. Space optimal means
> storing the spinlock in the low bits of the memory pointer instead of
> wasting another four or eight bytes on it, which is what I've done.
> Performance is indeed impacted by a bit, but not that much - after
> all it's all the same cache line anyway.
>
> I also didn't see lock backoff in your example code i.e. where both
> the promise and future are concurrently moved from different threads.
> Did I miss something?
>
> > Yes, replacing the spin and pointer updating with TM would be nice.
>
> And here is where things become very interesting.
>
> I started off assuming, as you would, that TM has two big wins:
>
> 1. You can update two or more disparate pointers atomically. Very
> handy for lock and wait free removal of items from linked lists.
>
> 2. You can skip locking where concurrent updating isn't a problem.
> How this works is you take care to volatile read from all those cache
> lines which if written to without locks would corrupt state, then you
> write as little as possible yourself.
>
> Where I have made a mistake is in assuming that TM is free. It is
> not: Intel TSX HLE is significantly slower for the single threaded
> case, while TSX RTM is not free to configure. For a simple two
> pointer update, using RTM is slower than a spinlock for example. If
> you compile with transactional GCC, all your code becomes 40% slower
> so you don't win in performance there either.
>
> Which suggests that TSX really needs a bigger chunk of transaction to
> make the overhead worth it, and hence why I thought I should write a
> TSX concurrent_unordered_map which, as I learned last night, runs at
> about 40% the single threaded performance of a spin locked
> unordered_map which is unacceptable. A split ordered list
> implementation using cmpxchg is twice quicker, but has the cost that
> erase is unsafe.
>
> So back to the drawing board again. I'm now thinking of simplifying
> by requiring that the mapped type is a shared_ptr<T> and I'll see
> what sort of design that might yield. I am finding that using TM is
> repeatedly not worth it so far due to the costs on single theaded
> performance, far more frequently than I expected. Maybe the next
> Intel chip will improve TSX's overheads substantially.
>
> Niall
>
>
I got the impression that writing in a transaction could be the expensive
part, especially if it was contested (having to rollback, etc). However, if
you entered a critical section for only reading, there would be less of a
penalty since it never "dirtied" the cacheline. Have you tested that too
(lots of readers few writers)? Intel's tbb::speculative_spin_rw_lock
_really_ makes sure that atomic flag is on its own cacheline (padding
everywhere), and acquiring the reader doesn't appear to do a write.
Although, the single threaded performance has me thinking that I am
mistaken; I feel like a novice despite reading so much about hardware
memory barriers.

Lee


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk