Boost logo

Boost :

Subject: Re: [boost] [Tokenmap] A (perfect) hash container library chcked-in to sandbox (RFC)
From: Daniel Trebbien (dtrebbien_at_[hidden])
Date: 2010-02-28 11:06:01


> Once the key is generated, it's "perfect" and collisions in find or
> pop are not possible.

I spent some time trying to understand the secret of this property,
and came up with a description of the problem and solution which I
thought was helpful.

A `tokenmap` is essentially a dynamic array-backed hash map (from
tokens [32-bit unsigned integers, WLOG] to `mapped_type` objects) with
custom find, rehash, and insert members. Calling the array `store` and
its size `store_size`, the find member returns the `mapped_type`
object `store[key % store_size]`. The rehash member doubles the size
of `store`, and for each element in the old store, the element is
placed at `new_store[key % 2*old_store_size]`. Finally, the insert
member basically finds an unused element of `store` where the given
`mapped_type` object can be placed, and returns some token value (key)
such that the token mod `store_size` is the index where the
`mapped_type` object was placed.

I denote the set of integers by Z and the ring of integers mod s by
Z_s. Elements of Z_s are denoted with so-called "class notation":
[x]_s is the set of integers that have the same remainder as x when
divided by s; in other words, for x an integer, [x]_s is the set of
all integers that are congruent mod s.

The size of `store` begins with s_0, the initial size, and proceeds to
s_1 after the first rehash. Generally, s_k is the size of the store
after k rehashes.

In order for the token-generating strategy to be "perfect", then the
key property of generated tokens x and y is:

[x]_{s_0} != [y]_{s_0} -> [x]_{s_k} != [y]_{s_k}

(If x and y mod s_0 have different indices in the store of length s_0,
then they must have different indices in the store of length s_k.)

and:

[x]_{s_k} != [y]_{s_k} -> [x]_{s_{k + 1}} != [y]_{s_{k + 1}}

Some choices of the sequence {s_k} will not work. For example, suppose
s_0 was 2 and s_1 was 3. Then, tokens 0x25b786ef and 0x0010b86c can be
generated while the store size is 2 (because mod 2, they are 1 and 0,
respectively), but once the store size becomes 3, then they hash to
the same location (they are both 2 mod 3).

In `tokenmap`, the sequence of sizes is 2^k*s_0. Generally, because
the size of the store is always a multiple of the initial size, then
the "perfect" property is maintained; I claim that [x]_s != [y]_s ->
[x]_{c*s} != [y]_{c*s} for any natural number c.

In order to prove the claim, I consider the logically-equivalent
contrapositive: [x]_{c*s} == [y]_{c*s} -> [x]_s == [y]_s:
[x]_{c*s} == [y]_{c*s} -> c*s | (x - y)
c*s | (x - y) -> s | (x - y) -> [x]_s == [y]_s qed.

So, in conclusion:
 1. In order to maintain the "perfect" property, the size of the store
only needs to be a multiple of the initial size.

 2. Special support from the container is needed so that the size of
the store is always a multiple of the initial size and rehashing is
performed correctly.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk