Boost logo

Boost :

Subject: Re: [boost] [lockfree] Review
From: Helge Bahmann (hcb_at_[hidden])
Date: 2011-08-08 06:41:33


On Monday 08 August 2011 10:59:47 Grund, Holger wrote:
> > > Agreed, this is not impossible, but I still tend to think we should
> >
> > strive
> >
> > > for a more efficient implementation if at all possible.
> >
> > Where do you see room for improvement? It is a fallacy to assume that
> > "most
> > efficient implementation" always means "there is a machine instruction
> > providing a 1:1 translation of my high-level construct". Look at this
> > from
> > the POV of cache synchronisation cost (which is the real cost, not the
> > number
> > of instructions), and you will realize that there is not much you can
> > do
> > (assuming you can squeeze the data copies as well as the sequence
> > counter
> > into the same cacheline).
> >
> > This approach BTW is already way faster than e.g. using a 64-bit mmx
> > register
> > and paying the cost of mmx->gpr transfers on x86.
>
> That doesn't match my experience. Even in the noncontended case, I would be
> very surprised to see anything "way faster".

mmx -> gpr has quite significant latency, don't forget that you need to
shuffle around a fair bit, plus the cost of the eventual emms

Also don't forget that moving mmx -> gpr defeats a large portion of the CPU's
out-of-order and speculative execution capability, the CPUs cannot in general
track dependencies across different register classes.

> However, under any kind of contention I do expect the MMX MOVQ version to
> be significantly faster.

Assuming that you manage to put everything into a single cache line, I doubt
that you will see any difference at all under contention: the real cost is
the cache line transfer and transferring a single one has a latency of ~150
cycles (and that's rather not going to decrease with modern CPUs). After
that, the number of bytes read out of the cache line are basically not
measurable anymore.

> And of course, 64 bits is less than 32 + 2 * 64.

It would rather be 32 + 64 +32 (there is no point in reading the "inactive"
copy, but the processor can certainly read it speculatively).

Helge


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk