|
Boost : |
From: Gennaro Prota (gennaro_prota_at_[hidden])
Date: 2002-07-30 19:29:05
Hi everybody,
thanks to Jeremy who patiently pointed me to the right location, I've
finally downloaded the new dynamic_bitset files :-)
Well, first of all, my apologies if some of the remarks show that I
have not followed the previous discussions about the library, but I
have been quite busy in the last days.
My attention has mainly focuses on the performances. Again, if it's
too early to consider that, please let me know.
IMHO many functions of the library can be effectively optimized (for
instance, I suspect that the shift operators <<, >>, >>= and <<= can,
though I've not tried yet). Also they can be optimized in a portable
way.
Since I was not sure whether others are interested in speeding up the
library now, I limited myself to one member function: count(). The
attached files contains two alternative implementations that use a
lookup table and turn out to be almost 6 times faster than the current
one. For instance here's an excerpt of the timings on my machine:
Timings for dynamic_bitset<unsigned int> [1000000 iterations]
--------------------------------------------------
Current implementation: 19.688 s
With 256-element table: 2.954 s
With 16-element table: 3.335 s
To avoid entangling the original code, the new stuff is in two
separate files: dynamic_bitset_table.hpp and table_proposal.impl. The
changes to dynamic_bitset.hpp have been limited to the declaration of
the new member functions count_tb and count_tbreduc and the #include
of dynamic_bitset_table.hpp.
Ok, I hear what you are all saying... how do you choose the size of
the table? :-) Yes, that's true. The provided solution is not fully
portable, because some C++ implementations may require a larger table
(depending on the value of CHAR_BIT) or a slightly different coding of
the count function (in order to break-up a CHAR_BIT-sized byte in more
than two parts - see the code for details) but the gain in speed is
remarkable, and the majority (all?) the C++ implementations currently
supported by boost have CHAR_BIT == 8. So I think it's worth
considering anyway, at least for some platforms.
Two words on the attached code: as I said the idea (used by STLPort
too) is to use a table in which the element of index n is the number
of digits in the binary representation of n. The easiest solution is
too use a table of 1+UCHAR_MAX elements, but, as the tests show, it
can be easily reduced to 1 << ((CHAR_BIT+1)/2) elements without
significant loss of performance, at least on an Intel box. For
instance the count_tbreduc actually uses only 16 elements of the
table.
On an implementation where CHAR_BIT == 9, 32 elements of the table
would be used. And the count of each byte would be computed as the
count of the lowest 5 bits + the count of the highest 4 bits.
If CHAR_BIT is, let's say, 32 then this "subdivision in two parts" of
each byte would require a table of 2 pow 16 = 65536 elements; thus a
solution could be to modify the code of the count_tbreduc function
num +=
(
detail::count<>::table [*p & lowmask] // lowest (CHAR_BIT+1)/2 bits
+detail::count<>::table [*p >> (CHAR_BIT/2)]
);
to use, for instance, 4 addenda.
Now, what's the situation? All the platforms currently supported by
boost would greatly benefit from this technique. Do we want to give it
up for sake of portability?
If not, I think at least 2 options are possible:
a) providing an implementation with a 16- or 32-slot table; then using
conditional compilation to choose between that table implementation
and the portable one. If the combination table-size/count()-loop is
not suitable for a given C++ platform boost automatically switches to
the portable 1-bit-a-time code.
b) as above, but with the possibility for the user to provide his own
bit-count solution for a single byte. For instance changing the loop
above in:
num += user_defined_count (*p);
Again, suitable macros could be used to control whether to switch to
the slow solution or asking the user to provide the
user_defined_count() function.
So far with count(). As I said it's likely that other functions, e.g.
the shift operators, could be optimized. If there are no objections I
can see whether making a two-pass shift (first pos/sizeof(Block)
blocks, then pos%sizeof(Block) bits) is faster. Anyone interested?
Now, some questions :-)
1. the code, it seems, does not compile on VC++. What's the intent?
Should it in the future?
2. the constructor
dynamic_bitset(size_type num_bits, unsigned long value, const
Allocator& alloc)
could be uneasy to use for some people (and slightly error prone)
because it requires to remember the order of the arguments for
num_bits and value. Were alternatives considered?
3. Many member functions use the literals '1' and '0' where the type
of the character is charT. Is this really generic? (Not that I know
how to otherwise, but I'm wondering...)
Genny.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk