Boost logo

Boost :

Subject: [boost] Idea for O(1) hash-table
From: Rani Sharoni (ranisharoni75_at_[hidden])
Date: 2013-03-16 04:17:16


Hi All,

I hope that you will find the following idea interesting.

I have been using hash-table with, AFAICT, unique implementation and I
thought that it might be handy for boost.
The hash table is O(1) *non-amortized* lookup via extremely balanced
buckets (for example, max 6 entries per bucket).

The implementation itself is absurdly simple ;-)
The hash-table is basically variation of linear addressing using
double-hashing hash-table with special re-insert strategy.

Algorithm for linear addressing (easy to adjust for chained hashing):
Consider a two dimensional N*K array – N buckets with up to K elements per
bucket.
Assume that the zero-key is marking an empty entry.

Lookup(key):
for (hash = key, I = 0 ; I < K ;++i) {
     hash = HASH(hash); // double-hashing
     if (A[hash % N][i] == key) return found;
}

return not-found;

Insert(key):
for (hash = key, I = 0 ; I < K ;++i) {
     hash = HASH(hash); // double-hashing
     if (A[hash % N][i] == 0) { A[hash % N][i] = key; return true; }
}

return false;

The above insert is basically a bounded double-hashing and as-is it has a
low utilization (a bit similar to bloom filter).
For example, for 100K buckets and K=6 - insert will fail when the table has
less than 50% utilization (i.e. fail to insert though most entries are
empty).

The trick is to perform “re-insert”:
Pick some random element, key1, from failed insert phase, replace it with
the new key, and (re) Insert(key1).
Retrying up to, say, 12 rounds of re-insert improves the utilization, of
N=100K/K=6, to ~97%!
Table resize is done (if needed) when such repetitive re-insert fails.

Altogether this implementation trade-off insert for fast lookup which is
very appealing to caches (i.e. insert done in the expensive path - only
after a cache miss).

Just for comparison, hash-table using naïve liner probing
(A[HASH(key)%N][i]) has only ~80% utilization even when using K=128 (i.e.
deep buckets)!
Such unbalanced behavior is also typical classical hash-tables with
link-lists (chained-hashing) since statistically some buckets will outgrow
the average.
The performance of the classic double-hashing algorithm itself is quite
poor when the table is highly utilize since many steps are required to find
empty entries.

Though this hash-table is general purposed it has some interesting
properties:
1) O(1) non-amortized with very low chance for collisions (each key has
its own unique “virtual” bucket).
2) Can be implemented on continuous memory without pointers and with
excellent utilization
3) Allow wait free implementation of caches (i.e. no lock or even
interlocked needed).
4) Allow caches on shared memory – corruption safe/hazard algorithm
(assuming at least 64 bit keys).
5) Allow persistent using direct memory to file mapping (i.e. disk
representation is same as in-memory one).
6) Average 50% LRU eviction for caches (i.e. on average entry half from
the oldest is evicted)

BTW – there is little chance for patenting this algorithm since I already
published similar idea in the public domain ;-)

Thanks,
Rani


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk