
Boost : 
Subject: [boost] Idea for O(1) hashtable
From: Rani Sharoni (ranisharoni75_at_[hidden])
Date: 20130316 04:17:16
Hi All,
I hope that you will find the following idea interesting.
I have been using hashtable with, AFAICT, unique implementation and I
thought that it might be handy for boost.
The hash table is O(1) *nonamortized* lookup via extremely balanced
buckets (for example, max 6 entries per bucket).
The implementation itself is absurdly simple ;)
The hashtable is basically variation of linear addressing using
doublehashing hashtable with special reinsert strategy.
Algorithm for linear addressing (easy to adjust for chained hashing):
Consider a two dimensional N*K array – N buckets with up to K elements per
bucket.
Assume that the zerokey is marking an empty entry.
Lookup(key):
for (hash = key, I = 0 ; I < K ;++i) {
hash = HASH(hash); // doublehashing
if (A[hash % N][i] == key) return found;
}
return notfound;
Insert(key):
for (hash = key, I = 0 ; I < K ;++i) {
hash = HASH(hash); // doublehashing
if (A[hash % N][i] == 0) { A[hash % N][i] = key; return true; }
}
return false;
The above insert is basically a bounded doublehashing and asis it has a
low utilization (a bit similar to bloom filter).
For example, for 100K buckets and K=6  insert will fail when the table has
less than 50% utilization (i.e. fail to insert though most entries are
empty).
The trick is to perform “reinsert”:
Pick some random element, key1, from failed insert phase, replace it with
the new key, and (re) Insert(key1).
Retrying up to, say, 12 rounds of reinsert improves the utilization, of
N=100K/K=6, to ~97%!
Table resize is done (if needed) when such repetitive reinsert fails.
Altogether this implementation tradeoff insert for fast lookup which is
very appealing to caches (i.e. insert done in the expensive path  only
after a cache miss).
Just for comparison, hashtable using naïve liner probing
(A[HASH(key)%N][i]) has only ~80% utilization even when using K=128 (i.e.
deep buckets)!
Such unbalanced behavior is also typical classical hashtables with
linklists (chainedhashing) since statistically some buckets will outgrow
the average.
The performance of the classic doublehashing algorithm itself is quite
poor when the table is highly utilize since many steps are required to find
empty entries.
Though this hashtable is general purposed it has some interesting
properties:
1) O(1) nonamortized with very low chance for collisions (each key has
its own unique “virtual” bucket).
2) Can be implemented on continuous memory without pointers and with
excellent utilization
3) Allow wait free implementation of caches (i.e. no lock or even
interlocked needed).
4) Allow caches on shared memory – corruption safe/hazard algorithm
(assuming at least 64 bit keys).
5) Allow persistent using direct memory to file mapping (i.e. disk
representation is same as inmemory one).
6) Average 50% LRU eviction for caches (i.e. on average entry half from
the oldest is evicted)
BTW – there is little chance for patenting this algorithm since I already
published similar idea in the public domain ;)
Thanks,
Rani
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk