Subject: Re: [boost] Non-allocating future promise... Re: ASIO into the standard (was: Re: C++ committee meeting report)
From: Niall Douglas (s_sourceforge_at_[hidden])
Date: 2014-07-09 06:43:47
On 8 Jul 2014 at 22:29, Lee Clagett wrote:
> > So back to the drawing board again. I'm now thinking of simplifying
> > by requiring that the mapped type is a shared_ptr<T> and I'll see
> > what sort of design that might yield. I am finding that using TM is
> > repeatedly not worth it so far due to the costs on single theaded
> > performance, far more frequently than I expected. Maybe the next
> > Intel chip will improve TSX's overheads substantially.
> I got the impression that writing in a transaction could be the expensive
> part, especially if it was contested (having to rollback, etc).
Aborting a RTM transaction is *very* expensive. I think this is why
HLE clearly uses a different internal implementation to RTM.
You also can only touch about 100 cache lines in a transaction before
you have a 50% chance of it aborting irrespective due to exceeding
internal buffer capacities (half the L1 cache is available for TM,
but it's shared).
> However, if
> you entered a critical section for only reading, there would be less of a
> penalty since it never "dirtied" the cacheline. Have you tested that too
> (lots of readers few writers)? Intel's tbb::speculative_spin_rw_lock
> _really_ makes sure that atomic flag is on its own cacheline (padding
> everywhere), and acquiring the reader doesn't appear to do a write.
I tried a many reader few writer approach (80/20 split) where readers
never write any cache lines. Aborts are even more costly than a many
writer approach, I assume because more cache lines must be thrown
away as more cache lines are marked as touched by the readers before
a writer collides with them. Yeah, I was surprised too.
Putting the fallback atomic flag into its own cacheline is okay as a
once off for maybe an entire container. Per-bucket it's excessive,
and per-future would be crazy.
BTW my RTM-enhanced spinlock doesn't acquire the spinlock but instead
starts a transaction which will abort if someone does acquire the
spinlock. That way all users of the spinlock use the critically
sectioned code without actually locking the spinlock. I did this
because a HLE enhanced spinlock is so slow in single threaded code,
whereas a RTM enhanced spinlock had acceptable single threaded
performance costs (~3%).
> Although, the single threaded performance has me thinking that I am
> mistaken; I feel like a novice despite reading so much about hardware
> memory barriers.
Well, do bear in mind this stuff isn't my forte. I could simply be
incompetent. It doesn't help I'm working on this stuff after a full
day of work, so my brain is pretty tired. I'm sure when someone like
Andrey gets onto this stuff he'll see much better results than I
-- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/