Boost logo

Boost :

Subject: Re: [boost] non-branching min()
From: Topher Cooper (topher_at_[hidden])
Date: 2009-09-10 20:39:50


You might find some sections of Sean Aaron Anderson's "Bit-Twiddling
Hacks" of interest, though perhaps not of practical use. Check out:

Compute the minimum (min) or maximum (max) of two integers without
branching
<http://graphics.stanford.edu/%7Eseander/bithacks.html#IntegerMinOrMax>
Conditionally set or clear bits without branching
<http://graphics.stanford.edu/%7Eseander/bithacks.html#ConditionalSetOrClearBitsWithoutBranching>
Conditionally negate a value without branching
<http://graphics.stanford.edu/%7Eseander/bithacks.html#ConditionalNegate>

Its worth noting that the better optimization your compiler has the
better the loop in your test case will do on your branchless method.
But a good optimizing compiler would probably avoid the branch in the
first place.

To use the technique to its fullest you or the optimizer should
translate your loop into something like (untested):

      type a[2] = {BIG_VALUE, ANY_VALUE};
      type* vp[2] = {--&v1[0], --&v2[0]}; // I know, but I'm expressing some possibly low level optimization.
      for (size_t i = 0; i<n; ++i)
      {
        a[1] = *vp[*++vp[1]<*++vp[0]];
        a[0] = a[a[1]<a[0]];
      }

(My apologies if I did something dumb -- my work situation has resulted in my not doing much C++ for a number of years -- but you should get the idea).

A good optimizing compiler should be able to do most or all of this from the inlined method, though the "relaxation" of the transfer of the move to "c[]" into an indirection as vp is pretty sophisticated. Binding the local a and the temporary containing the min of the v's to the vector they would otherwise be transfered to is routine.

Of course, if this were code you really wanted to optimize, rather than a test loop, you would improve locality by processing each vector separately by making c[] a reference

Topher

DE wrote:
> i'd like to present some kind of micro optimization towards pipelined
> cpus
>
> traditionaly min() has an implementation like
>
> template<typename type> inline
> type min(type a, type b) { return a<b ? a : b; }
>
> this function (even if inlined) introduces a branch in a sequence of
> instructions
> though branch prediction algorithms do exist they will guess at most
> 50% of them because routine input is arbitrary
> since a wrong branch prediction causes pipeline stall i thought that
> such a useful routine like min() MUST have non-stalling (read
> 'non-branching') implementation
> i thought hard and in the end recalled a simple implementation which
> is called LUT (for 'look-up table')
> here's what i finally got:
>
> template<typename type> inline
> type min(type a, type b)
> {
> const type c[2] = {a, b};
> return c[b<a];
> }
>
> intuitively this implementation should work slower than the first one
> but in real life it runs faster (roughly by a factor of 2 for 4 byte
> types on core2duo)
> here's a test code:
>
> #include <vector>
> int main()
> {
> typedef int type;
> const int n = 1000000;
> std::vector<type> v1(n), v2(n);
> //filling v1 and v2 somehow
> type min = BIG_VALUE;
> for (size_t i = 0; i<n; ++i)
> a = min(a, min(v1[i], v2[i])); //searching minima of v1 and v2
> return 0;
> }
>
> however for 'double' it runs ~2% slower (i think it's because creating
> a lut for 2 elements 8 bytes each is slower than a branch on c2d)
> nevertheless i think it's a great approach at least for ints
> also max() can implemented very similar
>
> you may skip the rest
> example of usage
> imagine we want to implement a symmetric matrix that stores only half
> of its elements (say upper triangle)
>
> a00 a01 a02
> --- a11 a12
> --- --- a22
>
> then access requests to the lower triangle elements would be
> redirected to upper triangle ones
> here we have a constraint that actually we only have access to
> elements a(i, j) where i<=j must always be true (look at
> indices above)
> if we use a branch like
>
> //...
> return i<=j ? a(i, j) : a(j, i);
>
> we get a pipeline stall almost every time we want to access an
> arbitrary element
> however if we rewrite it carefully as
>
> //...
> const size_t m = min(i, j); //non-branching min() here
> return a(m, m + abs(int(i) - int(j))); //abs() is also
> //non-branching
> we get rid of branches at all! and we can hope a pipelined cpu will
> handle it better than the other (branched) implementation
>
> --
> pavel
>
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
>
>
>


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk