Boost logo

Boost :

From: Daniel James (daniel_james_at_[hidden])
Date: 2008-04-19 06:26:55


On 19/04/2008, Thorsten Ottosen <thorsten.ottosen_at_[hidden]> wrote:
> Hi Jeremy,

Jeremy wrote the original version, I'm responsible for the library nowadays.

> I tried expanding your list of primes so that it contains more numbers
> in the small range and in particulat such that it contains numbers
> closely above powers of 2. It happens frequently that powers of 2 comes up.

The theory is that you get less collisions when the size is as far
from a power of 2 as possible. But I haven't looked into the research
that this is based on, so I'm not sure how trustworthy it is.

> Here's my new list:
>
> static const std::size_t prime_list[] = {
> 5ul, 11ul, 17ul, 29ul, 37ul, 53ul, 67ul, 79ul,
> 97ul, 131ul, 193ul, 257ul, 389ul, 512ul, 769ul,
> 1031ul, 1543ul, 2053ul, 3079ul, 6151ul, 12289ul, 24593ul,
> 49157ul, 98317ul, 196613ul, 393241ul, 786433ul,
> 1572869ul, 3145739ul, 6291469ul, 12582917ul, 25165843ul,
> 50331653ul, 100663319ul, 201326611ul, 402653189ul, 805306457ul,
> 1610612741ul, 3221225473ul, 4294967291ul };

It was requested that I use smaller increments during the review. I
haven't looked at it yet, but I created a ticket and I should get
round to it before 1.36:

http://svn.boost.org/trac/boost/ticket/1710

> For the test I conducted, not once did I get a max load factor above 1
> for containers of size 4, 5, 7, 9 etc up to about 512.

It's written so that it rehashes whenever an insert will cause the
maximum load factor to be exceeded. So you never will. The main
performance worry is that it will rehash too often so It might be
necessary to either let it exceed the maximum load factor a little, or
ensure that it always grows by at least a certain percentage.

Although for larger sizes you'll generally find that the memory use of
the nodes (the element size, one or two pointers and the memory
management overhead for each node) is much larger than the size of the
bucket array (one pointer per bucket and the overhead for a single
allocation).

> Another alternative would be to allow users to control the initial
> bucket coutn explcitly in the constructor. I think I might like this
> better. Something like
>
> boost::unordered_set<T*> the_set( boost::unordered::bucket_count(32u) );
> assert( the_set.bucket_count() == 32u );

The problem is that performance will typically degrade when the bucket
count isn't a prime number. Although if you have a high quality hash
function this shouldn't be the case.

If you did this, would you want the bucket count to remain fixed,
regardless of how large the container gets?

An alternative that has some appeal is letting the user specify a
policy class to control the bucket count. Although I'm not sure how
that would play with the standard.

Daniel


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk