|
Boost : |
From: John Maddock (John_Maddock_at_[hidden])
Date: 2000-01-11 07:56:26
Message text written by INTERNET:boost_at_[hidden]
>>
> 1. I don't buy the division of concepts you've chosen. A
> pattern should not contain state information:
> a. The same pattern should be usable to search two
> different texts in an interleaved fashion.
> b. A user who wrote a pattern to a file would not expect to
> be saving state information.
I think the word "state" is wrong here. BM/KMP information is more like
precomputed search tables. They depend only on the pattern, not on the
string(s) being searched. Um. As I write this, I wonder whether I've got
this backwards. If I have, I support you - a pattern object should only
contain information calculated from the search string, never from the
string
being searched. I expect
const BM_patt p("hello, world");
// point 1
pattern_search(str, p);
// point 2
to work. In other words, the observable state of p should not change
between
points 1 and 2 (const enforces this). But then again, pattern_search takes
a
const pattern_type& argument, so isn't that what I'm saying...?
<
There appears to be some confusion in terminology here - I've used "state"
to refer to the information pre-computed from the pattern - typically a
finite-state-machine (and yes the tables that Boyer-Moore and
Knuth-Morris-Pratte construct are state machines, albeit rather strange
ones), these are dependent only on the pattern and can be safely shared
between threads or whatever.
The interface for the pattern-type is based as closely as possible on
std::basic_string - which corresponds I think to perl's idea of an
"annotated" string that someone mentioned - there are differences however:
a pattern-type can not have insertions/appends/replace operations -
allowing arbitrary modifications to the pattern would make parsing patterns
that have a grammar (regular expressions/wild cards etc) either very hard
or very inefficient. In addition a pattern may be constructed from
non-text data - for example we can construct a boost::boyer_moore<long> and
expect it to seemlessly work as it does for character types - this need not
be a requirement for all pattern types however - regular expressions for
example are generally only used with "real" characters not non-text data.
Perhaps I choose the wrong examples in bm/kmp algorithms - in these cases
the "machine-readable" form of the pattern can be easily and rapidly
constructed, that isn't necessarily always the case - some complex regular
expressions may be quite expensive to parse and construct the state machine
for example.
Paul's usage above is correct - indeed one of the key requirements is that
constant patterns may be constructed once and then saved for future use -
or even pre-computed by a code generator.
The aim also is that generic code be possible: consider a fictitious text
editor which provides four kinds of search - a regular string search, a
string search with errors (aka fuzzy text searching), a phonetic text
search and a regular expression search, we may have pseudo-code something
like:
typedef boost::boyer_moore<char> lit_search;
typedef my_fuzzy_class<char> fuzzy_search;
typedef my_soundex<char> soundex_search;
typedef boost::reg_expression<char> re_search;
class buffer{/* our text editors buffer*/};
template <typename P>
void search_buffer(P& p, const buffer* b, std::string what)
{
// assign what to pattern and do search:
p.assign(what);
// search each line:
for(int i = 0; i < b->line_count(); ++i)
{
std::pair<std::string::iterator, std::string::iterator> result =
pattern_search(buffer->get_line(i), p);
if(result.first != result.end)
{
buffer->set_selection(i, result);
return;
}
}
}
//
// patterns declared as static members of fictional buffer class:
lit_search buffer::ls;
fuzzy_search buffer::fs;
soundex_search buffer::ss;
re_search buffer:rs;
void buffer::search(std::string what, int pat_type)
{
switch(pat_type)
{
case 0:
search_buffer(ls, this, what);
case 1:
search_buffer(fs, this, what);
case 2:
search_buffer(ss, this, what);
case 3:
search_buffer(rs, this, what);
}
}
OK, now imagine that a hot new phonetic search algorithm become available -
no problem just change the typedef of soundex_search to the new search type
and voila! the code instantly changes to use the new code.
With regard to case-less matching and equivalence classes: In the sample
implementations I've abstracted all locale dependent code into
pattern_traits<charT>, these determine how these equivalence classes are
defined, in the default implementation these use std::locale, but
alternative implementation are possible - for example on Win32 we could
replace std::locale with LCID and use that with the win32 API to handle
case issues directly. Alternatively we could hard code a locale into the
traits class if we have a particular special case (Devanagari script lets
say), in which case we would have to change the typedef for that
pattern-type (to use the new traits class) and rebuild the executable for
that locale. As a concrete example one of the regex++ users has added
Japanese character classifications - [[:Kanji:]] etc - by modifying the
traits class.
Finally DFA's vs NFA's, there seem to occasions where either one or the
other is more appropriate, and there are more state-machine implementations
out there than maybe any other algorithm (with new theoretical work
appearing at fairly regular intervals as well). The proposed regular
expression interface is based upon a separation between the locale (via a
traits class), the state-machine, and the front end regular expression
grammar. The aim is that the proposed state-machine interface should be
implementable in a wide variety of ways - whether as DFA, NFA, or even
strange beasts like the bit-packed NFA that agrep/glimpse uses. It should
also be reusable by other front-ends: wild card expressions for example.
The syntax for custom state-machines would look something like:
typedef boost::reg_expression<char, regex_traits<char>,
std::allocator<char>, my_state_machine<char, regex_traits<char>,
std::allocator<char> > > my_regex;
which is not exactly concise, but I hope will be required relatively
infrequently. The syntax could be simplified a lot by using template
template parameters, but those aren't really supported by current
compilers.
This is long enough for now ! :-)
Thanks to everyone who's had a look at this, I realise that the proposal
became somewhat long!
- John.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk