Boost logo

Boost :

Subject: Re: [boost] [hash] regular behaviour of hash function for double values
From: Topher Cooper (topher_at_[hidden])
Date: 2012-02-01 15:15:59


On 2/1/2012 3:59 AM, Daniel James wrote:
>> > and values that differ by a small amount end up in
>> > nearby bins -- which can cause excess collisions for data with some
>> > corelational structure.
>>
> The STL hash containers are pretty much required to be implemented
> using a chained hash table.
>
>
I've seen that claim, and I might well be missing something, but I
believe that coalesced hashing would also work. Deletion (erasure)
algorithms that retain an expectation of a few probes up to very high
load factors are complicated and somewhat costly though (as I remember
-- its been a long time since I worked with this) asymptotically within
bounds. I don't know of any shortcuts in resizing beyond rehashing all
the entries, but the requirements are consistent with that, and resizing
a growing table needs to be done less often for chained hashing due to
good performance up to much higher load factors than external chaining.
The cost is a less trivial implementation (but that is what libraries
are for), and constant higher expected cost of element erasure. The
gain is frequent much better memory usage.

In any case, how collisions are handled is irrelevant to my point. If
it wasn't clear, the structure that I was describing as possibly
"corelational" is structure in the data to be hashed rather than the
structure of the container.

Let's take an example. Let's say that I'm mapping usernames to some
kind of object, and that usernames consist of a last name plus a number
assigned in order when there is a duplicate -- a common practice. The
presence of a last name increases the probability that a last name is
common, so the presence of "cooper3" makes it more likely that there
will be a "cooper4". If the hashing scheme hashes those "adjacent"
usernames to adjacent hash-values/bins then if "james23" happens to
collide with "cooper3" then "james24" will also collide with "cooper4".
Consequently, under these conditions, moderate load factors will result
in somewhat worse than the constant performance expected from truly
uniform hashes.

For what it is worth, FNV1 hashing suffers from precisely this
correlational sensitivity under these circumstances (although having the
next last byte being likely to end up in either the next /or/ the
previous bin but will do so consistently and this may result in the
third in the sequence being more likely to collide with the first),
which is why the so called "slightly better" distribution properties of
FNV1a (which corrects this defect) are not as trivial as the developers
imply.

Topher


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk