
Boost : 
Subject: [boost] nonbranching min()
From: DE (satan66613_at_[hidden])
Date: 20090910 13:17:46
i'd like to present some kind of micro optimization towards pipelined
cpus
traditionaly min() has an implementation like
template<typename type> inline
type min(type a, type b) { return a<b ? a : b; }
this function (even if inlined) introduces a branch in a sequence of
instructions
though branch prediction algorithms do exist they will guess at most
50% of them because routine input is arbitrary
since a wrong branch prediction causes pipeline stall i thought that
such a useful routine like min() MUST have nonstalling (read
'nonbranching') implementation
i thought hard and in the end recalled a simple implementation which
is called LUT (for 'lookup table')
here's what i finally got:
template<typename type> inline
type min(type a, type b)
{
const type c[2] = {a, b};
return c[b<a];
}
intuitively this implementation should work slower than the first one
but in real life it runs faster (roughly by a factor of 2 for 4 byte
types on core2duo)
here's a test code:
#include <vector>
int main()
{
typedef int type;
const int n = 1000000;
std::vector<type> v1(n), v2(n);
//filling v1 and v2 somehow
type min = BIG_VALUE;
for (size_t i = 0; i<n; ++i)
a = min(a, min(v1[i], v2[i])); //searching minima of v1 and v2
return 0;
}
however for 'double' it runs ~2% slower (i think it's because creating
a lut for 2 elements 8 bytes each is slower than a branch on c2d)
nevertheless i think it's a great approach at least for ints
also max() can implemented very similar
you may skip the rest
example of usage
imagine we want to implement a symmetric matrix that stores only half
of its elements (say upper triangle)
a00 a01 a02
 a11 a12
  a22
then access requests to the lower triangle elements would be
redirected to upper triangle ones
here we have a constraint that actually we only have access to
elements a(i, j) where i<=j must always be true (look at
indices above)
if we use a branch like
//...
return i<=j ? a(i, j) : a(j, i);
we get a pipeline stall almost every time we want to access an
arbitrary element
however if we rewrite it carefully as
//...
const size_t m = min(i, j); //nonbranching min() here
return a(m, m + abs(int(i)  int(j))); //abs() is also
//nonbranching
we get rid of branches at all! and we can hope a pipelined cpu will
handle it better than the other (branched) implementation
 pavel
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk