|
Boost : |
Subject: [boost] non-branching min()
From: DE (satan66613_at_[hidden])
Date: 2009-09-10 13:17:46
i'd like to present some kind of micro optimization towards pipelined
cpus
traditionaly min() has an implementation like
template<typename type> inline
type min(type a, type b) { return a<b ? a : b; }
this function (even if inlined) introduces a branch in a sequence of
instructions
though branch prediction algorithms do exist they will guess at most
50% of them because routine input is arbitrary
since a wrong branch prediction causes pipeline stall i thought that
such a useful routine like min() MUST have non-stalling (read
'non-branching') implementation
i thought hard and in the end recalled a simple implementation which
is called LUT (for 'look-up table')
here's what i finally got:
template<typename type> inline
type min(type a, type b)
{
const type c[2] = {a, b};
return c[b<a];
}
intuitively this implementation should work slower than the first one
but in real life it runs faster (roughly by a factor of 2 for 4 byte
types on core2duo)
here's a test code:
#include <vector>
int main()
{
typedef int type;
const int n = 1000000;
std::vector<type> v1(n), v2(n);
//filling v1 and v2 somehow
type min = BIG_VALUE;
for (size_t i = 0; i<n; ++i)
a = min(a, min(v1[i], v2[i])); //searching minima of v1 and v2
return 0;
}
however for 'double' it runs ~2% slower (i think it's because creating
a lut for 2 elements 8 bytes each is slower than a branch on c2d)
nevertheless i think it's a great approach at least for ints
also max() can implemented very similar
you may skip the rest
example of usage
imagine we want to implement a symmetric matrix that stores only half
of its elements (say upper triangle)
a00 a01 a02
--- a11 a12
--- --- a22
then access requests to the lower triangle elements would be
redirected to upper triangle ones
here we have a constraint that actually we only have access to
elements a(i, j) where i<=j must always be true (look at
indices above)
if we use a branch like
//...
return i<=j ? a(i, j) : a(j, i);
we get a pipeline stall almost every time we want to access an
arbitrary element
however if we rewrite it carefully as
//...
const size_t m = min(i, j); //non-branching min() here
return a(m, m + abs(int(i) - int(j))); //abs() is also
//non-branching
we get rid of branches at all! and we can hope a pipelined cpu will
handle it better than the other (branched) implementation
-- pavel
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk