Boost logo

Boost :

Subject: Re: [boost] Idea for O(1) hash-table
From: Shakti Misra (shakti.misra.study_at_[hidden])
Date: 2013-03-16 08:35:33


Hi,
May be you can add in boost.hash. it will be a good add.
On Mar 16, 2013 2:42 PM, "Rani Sharoni" <ranisharoni75_at_[hidden]> wrote:

> Hi All,
>
> I hope that you will find the following idea interesting.
>
> I have been using hash-table with, AFAICT, unique implementation and I
> thought that it might be handy for boost.
> The hash table is O(1) *non-amortized* lookup via extremely balanced
> buckets (for example, max 6 entries per bucket).
>
> The implementation itself is absurdly simple ;-)
> The hash-table is basically variation of linear addressing using
> double-hashing hash-table with special re-insert strategy.
>
> Algorithm for linear addressing (easy to adjust for chained hashing):
> Consider a two dimensional N*K array – N buckets with up to K elements per
> bucket.
> Assume that the zero-key is marking an empty entry.
>
> Lookup(key):
> for (hash = key, I = 0 ; I < K ;++i) {
> hash = HASH(hash); // double-hashing
> if (A[hash % N][i] == key) return found;
> }
>
> return not-found;
>
> Insert(key):
> for (hash = key, I = 0 ; I < K ;++i) {
> hash = HASH(hash); // double-hashing
> if (A[hash % N][i] == 0) { A[hash % N][i] = key; return true; }
> }
>
> return false;
>
> The above insert is basically a bounded double-hashing and as-is it has a
> low utilization (a bit similar to bloom filter).
> For example, for 100K buckets and K=6 - insert will fail when the table has
> less than 50% utilization (i.e. fail to insert though most entries are
> empty).
>
> The trick is to perform “re-insert”:
> Pick some random element, key1, from failed insert phase, replace it with
> the new key, and (re) Insert(key1).
> Retrying up to, say, 12 rounds of re-insert improves the utilization, of
> N=100K/K=6, to ~97%!
> Table resize is done (if needed) when such repetitive re-insert fails.
>
> Altogether this implementation trade-off insert for fast lookup which is
> very appealing to caches (i.e. insert done in the expensive path - only
> after a cache miss).
>
> Just for comparison, hash-table using naïve liner probing
> (A[HASH(key)%N][i]) has only ~80% utilization even when using K=128 (i.e.
> deep buckets)!
> Such unbalanced behavior is also typical classical hash-tables with
> link-lists (chained-hashing) since statistically some buckets will outgrow
> the average.
> The performance of the classic double-hashing algorithm itself is quite
> poor when the table is highly utilize since many steps are required to find
> empty entries.
>
> Though this hash-table is general purposed it has some interesting
> properties:
> 1) O(1) non-amortized with very low chance for collisions (each key has
> its own unique “virtual” bucket).
> 2) Can be implemented on continuous memory without pointers and with
> excellent utilization
> 3) Allow wait free implementation of caches (i.e. no lock or even
> interlocked needed).
> 4) Allow caches on shared memory – corruption safe/hazard algorithm
> (assuming at least 64 bit keys).
> 5) Allow persistent using direct memory to file mapping (i.e. disk
> representation is same as in-memory one).
> 6) Average 50% LRU eviction for caches (i.e. on average entry half from
> the oldest is evicted)
>
> BTW – there is little chance for patenting this algorithm since I already
> published similar idea in the public domain ;-)
>
> Thanks,
> Rani
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk