Boost logo

Boost Users :

From: Brian Allison (brian_at_[hidden])
Date: 2006-08-29 10:56:09


david v wrote:

>Hello,
>As James pointed out the right place to start with will be suffix tree
>algorithms. The only thing i don't know if this is going to be very time
>consuming ???
>
>
>######For Brian....
> 1) What's the longest string you're searching for?
>For the moment is limited to 6 to 8 characters.
>
> 2) What's the smallest bit-width of any cpu on which you're running
>your code?
>I'm on 64 dual opteron machine
>
> 3) How familiar with string searching/matching are you?
>Quite familiar but was looking for a cpp library that could handle it
>instead of trying to rewrite things.
>
>Best,
>david
>

  David,

  In my experience, the bit-parallel algorithm of Navarro/Hyyro is
significantly faster to build and smaller in memory than the suffix
trees. As a third bonus, it's a faster search technique. Two papers:
  1) "A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit
Distances" by Heikki Hyyro of the Dept of Computer and Information
Sciences, University of Tampere Finland.
  2) "Faster Bit-parallel Approximate String Matching" by Hyyro and
Gonzalo Navarro. I think that paper is from the Proc. 13th Combinatorial
Pattern Matching

  Both should show up on a google search if you use the exact title.

  Since you've a 64-bit machine, you could handle up to 63 characters of
a search pattern at a max, with any length text to search.

  Apologies that I can't share my implementations with you on this,
they're not for searching but for string matching. I seem to remember
that the papers are oriented towards your type of searching across a
large text stream.

regards, and congrats in advance,
Brian


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net