|
Boost : |
From: Peter Dimov (pdimov_at_[hidden])
Date: 2005-03-19 07:55:28
Daniel James wrote:
> Peter Dimov wrote:
>
>> This is not an assumption. It is the specification of hash_value that
>> determines how it is to be used. Currently it does not offer any
>> guarantees for the entropy of any particular bits, so the client of
>> hash_value is expected to not do something stupid, like throw away
>> 28 of its 32 bits.
>
> From google's sparse hash table docs:
>
> "(As a side note, using a table size that's a power of two has several
> advantages, including the speed of calculating (i % table_size). On the
> other hand, power-of-two tables are not very forgiving of a poor hash
> function. Make sure your hash function is a good one! There are plenty
> of dos and don'ts on the web (and in Knuth), for writing hash
> functions.)"
It seems intuitively obvious to me that we should stick with the original
requirements on hash_value, that is, "equivalent inputs produce equal
outputs, distinct inputs likely produce distinct outputs". That's because
the hash table components are outnumbered by the hash_value overloads by
several orders of magnitude, so it makes sense to shift the responsibility
to produce a good distribution away from hash_value.
As for table sizes being a power of two being a source of speedup, this is
obviously only true when hv % 2^n doesn't produce collisions. If the key set
doesn't already have this property, the cost of fixing it (in the general
case, where it is not known which bits vary, something like x ^ (x % p1 + x
* p2) needs to be used) may cancel the speedup obtained by using a 2^n table
size.
On the other hand, it doesn't seem to cost us anything to apply x + (x >> 3)
to our pointer hash function, so we might as well do it.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk