|
Boost : |
Subject: Re: [boost] [hash] regular behaviour of hash function for double values
From: Topher Cooper (topher_at_[hidden])
Date: 2012-01-31 19:44:57
On 1/31/2012 1:55 PM, Daniel James wrote:
> The requirements for the hash function are:
>
> For two diï¬erent values
> t1 and t2, the probability that h(t1) and h(t2) compare
> equal should be very small, approaching 1.0 / numeric_
> limits<size_t>::max().
>
> There's no requirement about how h(t1) and h(t2) are distributed so it
> must be the container's responsibility, since the container must work
> well with any hash function that meets these requirements. How else
> can you interpret it?
>
As what it says.
It does not say anything about the distribution from which the two
different values are drawn so the implication is that for /any /two
values the probability must approach 1/max-value. Technically, of
course, this is impossible -- you can always choose the values from a
distribution based on inverting the hash (that this might be extremely
costly or even infeasible is irrelevant). A reasonable view, however,
is that if the values are chosen using any reasonable method that is not
too specific to the particular hash method then the probability of a
collision is as stated. Any simple pattern in the input that results in
a pattern in the output -- a restriction over the possible range of
output values for meaningful subsets of all possible values -- will
produce collision probabilities much higher than that.
If we do not read it that way -- if we take "two different values" to
mean "two values uniformly selected from all possible input values" --
then the requirement is trivial. For any kind of value that is
represented in log-2(numeric_limits<size_t>::max()) bits or less, just
use the identity function, for anything larger, just use the first
bits-per-size_t-value bits and throw out the rest. That is all it takes
to conform to the standard by that reading.
But even if this was not the case and this hash function can be taken to
conform to the standard --
There is no explicit requirement for quality anywhere in the standards
to the best of my knowledge. For example, a linear asymptotic
performance requirement does not /specify/ that the there isn't a two
minute overhead independent of N for each operation -- avoiding that in
an implementation is about /quality/ rather than /conformance/. You can
consider it to be an accurate implementation of the standard but you
can't consider it a good one.
There are specialized applications where a poorly distributed hash (at
least poorly distributed in just the right way) is not only acceptable
but required. Sometimes you want to ignore some specific differences so
that values that differ only by those characteristics still hash to the
same value (obvious example -- case-insensitive string hashes). When
hashing is used to locate similar objects the hash function must
transform values that differ in small ways -- that are similar -- to
hash values close to each other. The classic Gorin spelling correction
algorithm hashed each word from a document by its length and first two
letters allowing any word with a single letter insertion, deletion or
replacement or a two letter transposition to be easily found, and there
are lots of geometric algorithms that reduce dimensionality or range in
a desirable way so that "close" points are still close, or at least
likely to be.
But there are a number of characteristics that are routinely considered
the characteristics of a /general purpose/ hash function of at least
moderate quality and the avoidance of this kind of regularity is one of
them.
The suggestion that the output could be fixed by adding a "bit mixing
function" is not really relevant, and neither is the statement that
because the standard does not specify a good quality hash that it is the
container's responsibility to "fix it". The only way for the container
to do that is to rehash the hash value -- which is essentially what the
bit-mixing function does (such a function would be a not-terrible hash
function for fixed width values for many purposes -- as long as input
and output having the same number of 1-bits is not a problem. Note,
however, that the specific example that demonstrated the problem
produced lots of bits the same, which would still reduce the number of
possible values for structured inputs however much the bits were
shuffled around).
To specify that a hash-based algorithm should be written to be proof
against poorly distributed inputs essentially means that the expectation
is that the input hash value should be rehashed. Of course, its
standard for have hash tables to have prime-like lengths (generally
taken to mean to avoid lengths with small -- especially powers of 2 --
divisors) but that has to do with the quality of the necessary range
reduction (a.k.a., value folding). This is generally considered good
enough precaution to allow "raw integer values" to be used directly for
many applications, but for less casual use, this is not acceptable
(values that differ by the table size end up in the same bin, and values
that differ by a small amount end up in nearby bins -- which can cause
excess collisions for data with some corelational structure).
Lets turn one of the arguments made on its head. Why should someone who
finds it convenient to use a reasonable quality hash on their data type
pay the overhead of using a container that compensates for the
possibility of a poor one? Is it a more reasonable expectation that
someone who wishes a different balance of quality vs performance then
the default should rewrite the standard hash or rewrite the standard
unordered_map?
Sorry for the length -- I had a lot to say, I guess.
Topher Cooper
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk