Boost logo

Boost :

Subject: [boost] non-branching min()
From: DE (satan66613_at_[hidden])
Date: 2009-09-10 13:17:46


i'd like to present some kind of micro optimization towards pipelined
cpus

traditionaly min() has an implementation like

    template<typename type> inline
    type min(type a, type b) { return a<b ? a : b; }

this function (even if inlined) introduces a branch in a sequence of
instructions
though branch prediction algorithms do exist they will guess at most
50% of them because routine input is arbitrary
since a wrong branch prediction causes pipeline stall i thought that
such a useful routine like min() MUST have non-stalling (read
'non-branching') implementation
i thought hard and in the end recalled a simple implementation which
is called LUT (for 'look-up table')
here's what i finally got:

    template<typename type> inline
    type min(type a, type b)
    {
      const type c[2] = {a, b};
      return c[b<a];
    }

intuitively this implementation should work slower than the first one
but in real life it runs faster (roughly by a factor of 2 for 4 byte
types on core2duo)
here's a test code:

    #include <vector>
    int main()
    {
      typedef int type;
      const int n = 1000000;
      std::vector<type> v1(n), v2(n);
      //filling v1 and v2 somehow
      type min = BIG_VALUE;
      for (size_t i = 0; i<n; ++i)
        a = min(a, min(v1[i], v2[i])); //searching minima of v1 and v2
      return 0;
    }

however for 'double' it runs ~2% slower (i think it's because creating
a lut for 2 elements 8 bytes each is slower than a branch on c2d)
nevertheless i think it's a great approach at least for ints
also max() can implemented very similar

you may skip the rest
example of usage
imagine we want to implement a symmetric matrix that stores only half
of its elements (say upper triangle)

    a00 a01 a02
    --- a11 a12
    --- --- a22

then access requests to the lower triangle elements would be
redirected to upper triangle ones
here we have a constraint that actually we only have access to
elements a(i, j) where i<=j must always be true (look at
indices above)
if we use a branch like

    //...
    return i<=j ? a(i, j) : a(j, i);

we get a pipeline stall almost every time we want to access an
arbitrary element
however if we rewrite it carefully as

    //...
    const size_t m = min(i, j); //non-branching min() here
    return a(m, m + abs(int(i) - int(j))); //abs() is also
                                            //non-branching
we get rid of branches at all! and we can hope a pipelined cpu will
handle it better than the other (branched) implementation
                                            

--
pavel

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk