On Thu, Apr 7, 2011 at 7:56 PM, Scott McMurray <me22.ca+boost@gmail.com> wrote:
On Thu, Apr 7, 2011 at 10:08, Erik Scorelle <escorelle.work@gmail.com> wrote:
> We have been using boost hash to hash filenames, but have found with some of
> our user data that certain strings will produce the same hash code (
> "0012g6" and "0012fu" for example).  Is there a recommended way to predict
> or resolve these sorts of conflicts?
>

Having some collisions is the expected behaviour of a
(non-cryptographic) hash function.  With a birthday search you can
easily find thousands of examples.

I'm not sure what you mean by "resolve".  Normally, code using hashes
expects collisions, and uses a full equality operator to resolve them.

If you cannot accept any collisions, you could use perfect hashing
(libraries like <http://cmph.sourceforge.net/> or
<http://www.gnu.org/software/gperf/gperf.html>) or use a cryptographic
hash (I have a usable, but WIP library at
<http://svn.boost.org/svn/boost/sandbox/hash/>).

~ Scott

Scott,

AFAIK perfect hashing works only with the known set of keys, otherwise it is not possible.
Just to make a note: crypto-hashes are much slower as normal ones and use much more memory. Erik, how many file names are you going to hash? There was an article on Dr.Dobbs which compared unordered_map (hash table) implementation with std::map. And it turned out that until some certain amount of items it makes no sence to use a hash table, because a normal map is much faster.

Here is the link: http://drdobbs.com/cpp/198800559?pgno=1


With Kind Regards,
Ovanes