|
Boost : |
From: Thomas Wenisch (twenisch_at_[hidden])
Date: 2002-10-07 13:55:15
On Mon, 7 Oct 2002, Craig Henderson wrote:
>
>
> CRC does not have 100% guarantee uniqueness. My implementation is used in a
> scenario where I know that the strings must be unique, and so can get away
> with this. In a generic solution, the case would need to be handled, perhaps
> by comparing the length of the strings and if they are equal, then the
> contents should be compared with a standard memcmp(). This will, of course,
> add overhead to the algorithm, but the obvious no-match cases will be
> handled very efficiently.
>
> Unforunately your has algorithm does not 100% guaranteed unique either.
> However, it will (probably) cause less collisions than the CRC value because
> of the size of the hash. I use a 16-bit CRC despite storing it in a 32-bit
> value - this is just an oversight in my code. I suggest you take a look at
> the SAH1 algorithm - this has not, AFAIK, ever been seen to generate a
> duplicate value; unlike MD5.
By a simple counting argument, it can be proven that any algorithm that
hashes strings with more bits in them than to hash codes with less bits
will produce duplicates for some strings. While it may be less likely for
SHA1 to produce duplicates (ie maybe it map strings to each of the
possible hash codes with equal probabiliy), checking the hash codes can
never guarantee that the strings are the same, only that they are
different. There is no hash algorithm that can. The only way to
guarantee the strings are the same is to actually compare the strings.
Regards,
-Tom Wenisch
Computer Architecture Lab
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk