Subject: Re: [boost] [atomic] (op)_and_test naming
From: Andrey Semashev (andrey.semashev_at_[hidden])
Date: 2018-01-26 14:45:40
On 01/26/18 16:32, Peter Dimov via Boost wrote:
> Andrey Semashev wrote:
>> > While we're on the subject, on what architectures would opaque_sub
>> be > more efficient than sub_and_test?
>> On x86 and gcc < 7 opaque_sub allows to use "lock sub" or "lock dec"
>> without setting the bool according to the zero flag, i.e. it saves a
>> register and an instruction.
> Right, thanks. I was thinking that testing for zero comes for free, but
> it's not (entirely) free for the reason you give. Does this actually
> matter in practice? I would expect the atomic to dominate the `set(n)z al`.
Latency-wise, I expect that to be mostly true. But wasting a register
may be undesirable if it causes a spill on the stack somewhere in the
surrounding code, especially if this is a tight loop. In any case, I
just want to be able to generate the best possible code with the
interface atomic<> provides.
>> Gcc 7 introduced the ability to return flags from the asm statement,
>> so the code can be written the same way. Although I noticed that the
>> compiler tends to save the flag into a register early unless it is
>> tested immediately, so in some cases opaque_sub might still be
>> preferable where it suits.
> Don't see how opaque_sub could be preferable if you need to test the
> flag later. :-)
Of course. :) I meant, in the case where you don't need the result,
opaque_sub is still preferable to fetch_sub or sub_and_test.
> Presumably, if you just call the function and discard
> the return value - the equivalent of opaque_ - the compiler would be
> smart enough to not save the flag.
Hopefully, but I wouldn't bet on it. I've seen gcc generate "setz" then
a couple of "movs" which were moved from god knows where and then "test"
and a conditional jump. Clearly, "movs" don't alter flags, so the spill
and the test are useless.
Admittedly, dropping "setz" when the result is unused is a different
kind of optimization. But my point is that optimizations like these are
generally unreliable, and if you really want to have the best possible
code then you should better write it in a way so the compiler has less
opportunity to screw up.
> I remember some compilers being smart enough to notice that you don't
> use the result of the atomic fetch_op intrinsic and generating the `lock
> op` themselves, without a separate opaque_op being needed. We can't do
> that on the library level, of course.
Yes, I've seen gcc 7 (and maybe 6?) do that on occasion, but it seemed
that it didn't always do that either. I didn't investigate that closely
to find out why it didn't always optimize.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk