Subject: Re: [boost] [hash] regular behaviour of hash function for double values
From: Howard Hinnant (howard.hinnant_at_[hidden])
Date: 2012-02-01 12:52:44
On Feb 1, 2012, at 12:15 PM, Andrey Semashev wrote:
> On Wednesday, February 01, 2012 10:57:18 Howard Hinnant wrote:
>> As a judgement call, I calculated that I would have *many* more clients who
>> are casual users of the std::unordered containers, defaulting most options
>> such as the hash function, than I would have clients interested in using
>> std::hash<scalar> in their own containers, or in other places a hash
>> function might be useful.
> Thank you for your careful answer. As I understand, in order to compensate the
> std::hash simplicity you do additional hashing (bit mixing or modulus) in the
> container. May I ask, have you considered the other approach - to make the
> container simpler but add bit mixing to std::hash? Why you chose not to follow
> this path?
Yes, I've considered that path. The libc++ containers are constrained to hold only a prime number of buckets. And all hashes are modulo'd to fit within the number of buckets. The use of modulus combined with a prime number of buckets has a long and successful history for hash containers, dating back longer than I've been in this industry.
Another approach I've read about is to use a power of 2 number of buckets and an & operation to bring the hash into the bucket range. If I had chosen this path I would probably also choose to use a stronger bit mixing for all default hash functions. Otherwise, poor collision performance is all but assured.
One of the reasons I prefer the prime solution to the power of 2 solution is I believe it provides better protection against user-written hash functions that happen to be poorly chosen. I have decided to opt against "well, you wrote a poor hash function, you get what you deserve." Had this protection been unreasonably expensive, I would not have chosen this path. However, in my estimation this cost is very, very low. The use of the modulus is more expensive than an &, but not greatly so on modern hardware.
Expensive things to do on older hardware include instructions like 64 bit modulus and floating point divide. When I wrote Fortran 30 years ago you could literally predict the performance of numerical code by just counting the floating point divides. You could completely ignore the additions and multiplications because they were practically free by comparison.
On modern hardware expensive things to do include flushing the L1 cache and allocating memory. By comparison, individual instructions (except for memory barriers) are practically free.
The power of 2 solution is often associated with "incremental rehashing". As I recall this has the advantage of spreading a rehash out over many insertions so that no one insertion is very expensive. However the API of the std::unordered containers is very "vector-like": You can predict when rehash's are coming by inspecting things like the number of buckets, the load factor and the maximum load factor. You can even plan when your rehashes are going to happen by pre-allocating a predicted number of needed buckets, and or manipulating max_load_factor, and then calling rehash explicitly at convenient times.
So because of the ability to fine tune when and how rehashes happen, the chief selling point of incremental rehashing (and subsequently power of 2 number of buckets) is somewhat muted, imho.
I am not always right. Indeed I pride myself on continually learning new things. Another way of pronouncing "continually learning" is: I was wrong, I learned something new, and now I think I'm right. So I could be wrong here too, and am about to learn something. I accept that. However I've been studying and shipping hash containers for about a dozen years now, and my current position is based on my customer feedback over that period of time.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk