Boost logo

Boost :

From: John Maddock (John_Maddock_at_[hidden])
Date: 2000-01-11 07:56:26


Message text written by INTERNET:boost_at_[hidden]
>>
> 1. I don't buy the division of concepts you've chosen. A
> pattern should not contain state information:
> a. The same pattern should be usable to search two
> different texts in an interleaved fashion.
> b. A user who wrote a pattern to a file would not expect to
> be saving state information.

I think the word "state" is wrong here. BM/KMP information is more like
precomputed search tables. They depend only on the pattern, not on the
string(s) being searched. Um. As I write this, I wonder whether I've got
this backwards. If I have, I support you - a pattern object should only
contain information calculated from the search string, never from the
string
being searched. I expect

const BM_patt p("hello, world");
// point 1
pattern_search(str, p);
// point 2

to work. In other words, the observable state of p should not change
between
points 1 and 2 (const enforces this). But then again, pattern_search takes
a
const pattern_type& argument, so isn't that what I'm saying...?
<

There appears to be some confusion in terminology here - I've used "state"
to refer to the information pre-computed from the pattern - typically a
finite-state-machine (and yes the tables that Boyer-Moore and
Knuth-Morris-Pratte construct are state machines, albeit rather strange
ones), these are dependent only on the pattern and can be safely shared
between threads or whatever.

The interface for the pattern-type is based as closely as possible on
std::basic_string - which corresponds I think to perl's idea of an
"annotated" string that someone mentioned - there are differences however:
a pattern-type can not have insertions/appends/replace operations -
allowing arbitrary modifications to the pattern would make parsing patterns
that have a grammar (regular expressions/wild cards etc) either very hard
or very inefficient. In addition a pattern may be constructed from
non-text data - for example we can construct a boost::boyer_moore<long> and
expect it to seemlessly work as it does for character types - this need not
be a requirement for all pattern types however - regular expressions for
example are generally only used with "real" characters not non-text data.

Perhaps I choose the wrong examples in bm/kmp algorithms - in these cases
the "machine-readable" form of the pattern can be easily and rapidly
constructed, that isn't necessarily always the case - some complex regular
expressions may be quite expensive to parse and construct the state machine
for example.

Paul's usage above is correct - indeed one of the key requirements is that
constant patterns may be constructed once and then saved for future use -
or even pre-computed by a code generator.

The aim also is that generic code be possible: consider a fictitious text
editor which provides four kinds of search - a regular string search, a
string search with errors (aka fuzzy text searching), a phonetic text
search and a regular expression search, we may have pseudo-code something
like:

typedef boost::boyer_moore<char> lit_search;
typedef my_fuzzy_class<char> fuzzy_search;
typedef my_soundex<char> soundex_search;
typedef boost::reg_expression<char> re_search;

class buffer{/* our text editors buffer*/};

template <typename P>
void search_buffer(P& p, const buffer* b, std::string what)
{
  // assign what to pattern and do search:
  p.assign(what);
  // search each line:
  for(int i = 0; i < b->line_count(); ++i)
  {
    std::pair<std::string::iterator, std::string::iterator> result =
pattern_search(buffer->get_line(i), p);
    if(result.first != result.end)
    {
        buffer->set_selection(i, result);
        return;
     }
  }
}

//
// patterns declared as static members of fictional buffer class:

lit_search buffer::ls;
fuzzy_search buffer::fs;
soundex_search buffer::ss;
re_search buffer:rs;

void buffer::search(std::string what, int pat_type)
{
   switch(pat_type)
   {
   case 0:
     search_buffer(ls, this, what);
   case 1:
      search_buffer(fs, this, what);
   case 2:
      search_buffer(ss, this, what);
   case 3:
      search_buffer(rs, this, what);
   }
}

OK, now imagine that a hot new phonetic search algorithm become available -
no problem just change the typedef of soundex_search to the new search type
and voila! the code instantly changes to use the new code.

With regard to case-less matching and equivalence classes: In the sample
implementations I've abstracted all locale dependent code into
pattern_traits<charT>, these determine how these equivalence classes are
defined, in the default implementation these use std::locale, but
alternative implementation are possible - for example on Win32 we could
replace std::locale with LCID and use that with the win32 API to handle
case issues directly. Alternatively we could hard code a locale into the
traits class if we have a particular special case (Devanagari script lets
say), in which case we would have to change the typedef for that
pattern-type (to use the new traits class) and rebuild the executable for
that locale. As a concrete example one of the regex++ users has added
Japanese character classifications - [[:Kanji:]] etc - by modifying the
traits class.

Finally DFA's vs NFA's, there seem to occasions where either one or the
other is more appropriate, and there are more state-machine implementations
out there than maybe any other algorithm (with new theoretical work
appearing at fairly regular intervals as well). The proposed regular
expression interface is based upon a separation between the locale (via a
traits class), the state-machine, and the front end regular expression
grammar. The aim is that the proposed state-machine interface should be
implementable in a wide variety of ways - whether as DFA, NFA, or even
strange beasts like the bit-packed NFA that agrep/glimpse uses. It should
also be reusable by other front-ends: wild card expressions for example.

The syntax for custom state-machines would look something like:

typedef boost::reg_expression<char, regex_traits<char>,
std::allocator<char>, my_state_machine<char, regex_traits<char>,
std::allocator<char> > > my_regex;

which is not exactly concise, but I hope will be required relatively
infrequently. The syntax could be simplified a lot by using template
template parameters, but those aren't really supported by current
compilers.

This is long enough for now ! :-)

Thanks to everyone who's had a look at this, I realise that the proposal
became somewhat long!

- John.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk