On Thu, Apr 7, 2011 at 7:08 PM, Erik Scorelle <escorelle.work@gmail.com> wrote:
Hi,

We have been using boost hash to hash filenames, but have found with some of our user data that certain strings will produce the same hash code ( "0012g6" and "0012fu" for example).  Is there a recommended way to predict or resolve these sorts of conflicts?

Thanks in advance!

Hi!
Hash functions do not guarantee uniqueness. The case when the key is not unique called collision. Usually hash tables not only compare the hash key, but also the actual data to be sure that the right record was referenced. The most primitive way to implement a hash-table is to put all records hashing to the same key into a linked list and search linear. Hashing is effecient, if the choosen hash function has a good distribution and results in not so many collisions, which in turn results in amortized constant time for the access.

Boost unordered_map is the implementation of a hash table: http://www.boost.org/doc/libs/1_46_0/doc/html/boost/unordered_map.html

I suggest reading this article from Wikipedia: http://en.wikipedia.org/wiki/Hash_function

And for further reading refer to Cormen et al: Introduction to algorithms. They have the whole chapter dedicated to hashing techniques.


With Kind Regards,
Ovanes