Boost logo

Boost :

From: Andrey Semashev (andysem_at_[hidden])
Date: 2007-06-11 09:04:36


Hello JOAQUIN,

Monday, June 11, 2007, 1:46:53 AM, you wrote:

>> My technique was based on a number of hash tables and a little
>> knowledge of the wildcards nature. :) Do you consider this kind of
>> index could be generalized more to be included in B.MI (some kind of
>> "like_index" or "matching_index")?

> Could be. The scenario looks very attractive, and rather intriguing
> too. Could I learn more about your implementation techniques?
> This does not seem an algoritmhically trivial problem at first sight.

Like I said, I used the fact that only the end of the string may be
masked. This allowed me to calculate hash over the fixed chars in the
beginning of the wildcards and guarantee that these hash results will
be equal to the corresponding partial hash results of the complete
strings, calculated for the same number of the first chars of the
string. For example:

abc*
abc??*
abc???

will have the same hash, and it will be equal to hash of "abcde",
calculated for the first 3 chars.

Since I have all wildcards passing through insert methods, I can
determine the maximum number of the leading fixed chars in the
wildcards. This allows me an ability to calculate only needed partial
hash results of the full strings, as follows, for "abcde":

h0 - for empty string ""
h1 - for "a"
h2 - for "ab"
h3 - for "abc"

Other hash results (for "abcd" and "abcde") are not needed and this
not calculated.

Obviously, these three wildcards above would reside in the same bucket
in this approach, so I had to order them within the bucket by their
priority:

abc???
abc??*
abc*

That guarantees that lookup will yield a more detailed wildcard.

The downside is that the bucket (if the number of buckets is not very
big) may contain many elements which will decrease lookup speed.
That's why I went further and separated hash tables (i.e. lists of
buckets) by the number of the leading fixed chars. This made these
wildcards:

abc???
de*
f????

reside in different hash tables even if their hash results are equal.
The lookup algorithm remains the same except that it repeats for each
number of leading fixed chars decreasingly.

The worst lookup case would be N hash table lookups, where N is the
maximum number of leading fixed chars (which will most likely be 3 to
5 in my case). Considering that the bucket will likely contain no more
than one element, there will commonly be no more than 5 matching
attempts per a lookup. Not bad if there are 100 000 elements in the
container.

However, this may look like a questionable optimization in general,
and I have yet to test whether it does any good. But at least it
simplified implementation as I didn't have to complicate things with
dynamic rehashing heuristic.

As you may see now, my solution is tightly coupled with the specific
problem I faced. B.MI index approach should be way more general, if
such an index is to be developed.

-- 
Best regards,
 Andrey                            mailto:andysem_at_[hidden]

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk