Boost logo

Boost :

From: Gennaro Prota (gennaro_prota_at_[hidden])
Date: 2002-07-31 15:53:17


On Wed, 31 Jul 2002 11:00:00 -0500, Rene Rivera
<grafik666_at_[hidden]> wrote:

>[2002-07-31] Anatoliy Kuznetsov wrote:
>
>>
>>>
>>> As Anatoliy pointed out it's only faster when
>>> extended to more bits in
>>> parallel. But I've found that it's faster in the 32
>>> bit case and up.
>>
>>
>>Hmm, I need to reevaluate my test. Thanks for pointing
>>out.
>>
>>
>>> The speedup is mostly due to keeping the computation
>>> in the ALU, as opposed
>>> to doing memory access.
>>
>>
>>I got an impression that in my case all table goes
>>into the CPU cache which makes memory latency not a
>>big problem.
>
>True, if you are using small tables, and if you don't need that same chache
>space for other things. But I usually end up trying to free up as much cache
>space as possible to make room for algorithms where using tables gives you a
>100x speedup. I'd rather have the 100x speedup and worsen the speedups of
>10x or less.
>
>
>>> But if you are forced to do it a byte at a time, the
>>> table would be faster.
>>
>>
>>Also there is a problem of misalignment. If bitset is
>>character based it is not always safe to adress it as
>>int*. On non-Intel architectures it can cause a BUS
>>error.
>>
>>I got a nice crash on Sparc Solaris when tryed to
>>access stack allocated array of ints as an array of
>>64-bit longs. The same trick always worked on SGI.
>
>One big problem with using byte access is that in some CPUs the access cause
>pipeline hits. For example in PPC byte access is optimized to be equivalent
>to regular word access. But in x86 byte access is much slower than word
>access. I ran into this while writting an alpha blending routine, I ended up
>writting a switch statement that was more cases and more code (6x the code)
>that read word at a time. Just the reading alone was 16x faster this way. If
>you look at implementations of memcpy you'll see the same thing done...
>doing aligned whole word operations as much as possible.

My implementation makes no claim to be the fastest possible for any
existing C++ implementation/target machine. Of the about 30 methods to
do a bitcount computation that I know of (but certainly there are
others), using a table is IMHO one of best for its average
performances on different platforms and its clarity.

If you were writing the <bitset> header for a given compiler on a
given platform, let's say Intel C++ on Win32, you could bit twiddle
the implementation as much as wanted. You can even write it in
assembly if you think. The boost library, instead, has to run on many
different combinations of compilers/machines. And since there is no
single implementation that works equally everywhere (well, this is not
Java, right? :-)) we have a problem. As I said, many options are
possible. In particular:

- extremist approach: writing 13/14 different implementations and
select the appropriate one at compile-time depending on the platform.
I find this excessive.

It seems to me like maniacally choosing how to compute the value of pi

between:

4*atan(1), 2*asin(1) or acos(-1)

depending on which CRT library you are using. Certainly I wouldn't
like entangling the code with something like:

#if defined (CRT_IMPL1)
   const double pi = 4*atan(1);
#elif defined (CRT_IMPL2)
   const double pi = 2*asin(1);
#elif...
   ....
#endif

Analogously, I wouldn't do something similar for the bit count.

- reasonable approach (in my head :-)): choosing (besides the current
one) an implementation that performs well on most boost's target
platforms *but* giving the user the chance to provide his own
bit-twiddled version.

As to the "performs well on most boost's target platforms", I don't
have access to all of them. So I would be happy if someone provided
the results of the tests (see my message with attachment) for
non-Win32 machines. Anyone listening? :-)

For the user-provided version one can use any sort of technique that
is suitable for his own needs, ranging from HACKMEM 169 to a Hamming
weight routine written in assembly (or even to a dedicated CPU
instruction).

This may seem ugly at first but, we know, there's no silver bullet.
Too many things come into play in determining code performance.
Let's say you use the classical HACKMEM 169 implementation:

  int hackmem_count (unsigned long n)
  {
    unsigned long tmp
      = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);

    return ((tmp + (tmp >> 3)) & 030707070707) % 63;
  }

What if your machine doesn't have a suitable div opcode to translate
the '% 63'?
In that case, it _may_ be that manually coding the above as

  int hackmem_manualmodulo (unsigned long n)
  {
    unsigned long tmp
       = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
    tmp = (tmp + (tmp >> 3)) & 030707070707;
    while (tmp > 63)
      tmp = (tmp & 63) + (tmp >> 6);
    return tmp;
  }

turns out to be faster. And it _may_ be that on your combination
compiler/machine manually unrolling the loop speeds it up further:

  int hackmem_moduloloop_unrolled (unsigned long n)
  {
    unsigned long tmp
       = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
    tmp = (tmp + (tmp >> 3)) & 030707070707;
    tmp = (tmp & 63) + (tmp >> 6);
    tmp = (tmp & 63) + (tmp >> 6);
    tmp = (tmp & 63) + (tmp >> 6);
    tmp = (tmp & 63) + (tmp >> 6);
    tmp = (tmp & 63) + (tmp >> 6);
    return (tmp);
 }

There's one truth here: you can't know in general; you have to test.
For instance, even in the case above, nothing prevents the compiler,
that knows the assembly of your machine, to translate the first
version as the second one or all the 3 versions the same way. If your
compiler is smart enough it can even recognize the following

 int count(byte n) {

    int i = 0;
    int num = n;
    while (num) {
        num &= (num-1);
        ++i;
    }
    return i;

 }

and translate it with a dedicate machine instruction. (One could say
that it must be a genius to exercise the same capability with the
hackmem_moduloloop_unrolled version, don't you think? :-))

In short, if you want to save up to the last CPU cycle there's one
thing you can do: prepare a battery of tests with all the techniques
for bit-count you know of and run it on your machine. Use the compiler
you will use for your project; results depend on it (version and
command line switches included) and consider that on a machine
slightly different (for instance two Wintel boxes with different CPUs)
results can be different. And that they can be different if you
upgrade your compiler to the next version.

Maybe we could add a new template parameter that encapsulates the
bitcount logic, and provide a general-purpose implementation as its
default value. But I have not tried yet and I'm really open to
suggestions :-)
 

Genny.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk