|
Boost : |
Subject: [boost] Aho - Corasick String Search (for searching multiple patterns) Interface
From: meghana madhyastha (meghana.madhyastha_at_[hidden])
Date: 2015-05-26 01:21:01
Hi,
In some of my previous mails I had suggested that the Rabin Karp and Aho -
Corasick string search algorithms be implemented as a part of the Boost
algorithm library. This is primarily because these algorithms perform well
for searching for multiple patterns.
The Aho - Corasick algorithm works by building a trie (a tree with each
node corresponding to an ASCII character) of the patterns strings and
traversing the trie to search for the pattern in a given text.
Additionally, the Aho - Corasick introduced the concept of "failure
pointer/failure node" which is the node to be traversed when there is a
mismatch.
For more information:
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
The interface I propose is as follows.
1. The input of the patterns should be strings within a *container*. This
is because these algorithms are primarily used for multiple pattern
searching. The user can input pointers/iterators to the first and last
pattern . The final output which is returned is a container with each
pattern and a iterator/pointer to their occurrences (There are multiple
patterns so just returning an iterator/pointer to where the patterns first
occurs in the pattern will not suffice).
2. Just like the previous string searching algorithms that have been
implemented in Boost, there can be an object based interface as well as a
procedure based interface. In the object based interface, the trie is
constructed and the failure nodes are computed in the constructor (while
creating the object). The user can then search the text using the
operator().
template <typename patIter>
class aho_corasick {
public:
aho_corasick(patIter first, patIter last): pat_first(first),
pat_last(last), head(std::shared_ptr<trie_node>(new trie_node)) {
//first and last are pointers/iterators to the first and last patterns in
the container
gotoFn(head);
failureFn(head);
}
~aho_corasick () {};
template <typename corpusIter>
corpusIter operator ()(corpusIter corpus_first, corpusIter corpus_last)
//Note: using the same notation as used in the other search algorithms.
Corpus- refers to the text to be searched.
I've implemented this and would like to get feedback/suggestions on the
code as well as the interface. Where should I put up the code for it to be
reviewed ?
Thanks,
Meghana
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk