Boost logo

Boost :

From: Hartmut Kaiser (hartmutkaiser_at_[hidden])
Date: 2005-01-17 02:34:57


Hi all,

recently I've written a minimal perfect hash function generator, which
generates very efficient hashing functions for arbitrary (reasonable large)
key sets containing strictly positive integers to achieve a minimal and
perfect mapping of these keys to hash values in the range [0..N), where N is
the size of the key set (minimality requirement) and the hash values are
unique (perfection requirement).

IOW, given a set of N arbitrary positve integers (xi > 0) this hash function
generator allows to map the integers to another set of integers, the values
of which are in between and including 0 and N-1. No duplicate mappings are
generated.

Generally the developed MPH (minimal perfect hash) generates the
coefficients for the following hash functions:

    h(x) = (C / (D*x + E) mod N or
    h(x) = (C / x) mod N (D == 1, E == 0)

where the second one is choosen almost all the time (generally C, D and E
are integers).
Especially the second one requires only a couple of instructions to be
executed, which makes it a very effective selection alternative for table
oriented algorithms.

The requirement, that the input keys should be strictly positive isn't a big
limitation, because it is always possible to accordingly transform key sets
containing possibly negative numbers, or to derive the keys from strings to
match.

Sounds really perfect! But there are some drawbacks:

- the keys (obviously) must be known in advance, so a corresponding hash
function may be generated.
- some initial computation is required to find the coefficients C, D and E
(which is obvious as well :-).

The initial overhead is tolerable, though, here are some timings on my
machine - Pentium M, 1.6GHz, 1GByte:

      20 keys: 0.005 [s], 200 keys: 0.03 [s], 2000 keys: 0.15 [s], 20000
keys: 4.45[s]

where the time is raising slightly more worse than linear (actually it has a
linear and an exponential component, but the exponential component kicks in
for larger key sets only), depending on the number of keys.
The memory requirement for the MPH generator is raising linearily with the
number of keys: roughly N/3 * sizeof(int), where N is the number of keys.

This hash function generator is usable for a wide range of applications,
where efficient data lookup is a major goal and where sparse data sets have
to be treated efficiently.

Is there anybody interested in such a thing?

Regards Hartmut


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk