Boost logo

Boost :

From: Dave Abrahams (abrahams_at_[hidden])
Date: 2000-01-09 12:42:37


>>The distinction of case-sensitivity, while useful, opens up a whole can of
>> worms. For example, what about diacritical marks? Often, they should be
>> ignored. I don't know what other language/alphabet-dependent features might
>> exist out there...<

> I agree, however I think that it's a neglect to simply ignore the issue -
> even if the "best solution" - whatever that is - is imperfect.

I absolutely agree with this approach.

> It is up to
> the implementer of a particular pattern type to decide what this means and
> document it - in the case of the sample classes all locale dependent code
> is abstracted out to a traits class - users that require differing
> semantics from the default can obtain this by providing their own traits
> class that handles the distinctions. To put this another way, consider
> each character in a "pattern" as a member of two equivalence classes - one
> case sensitive, one case insensitive - each member of an equivalence class
> is treated as the same as any other member of the same class. By
> incorporating case into the spec design, there is a way to choose which of
> these two classes we want to deal with - of course there may be other
> equivalence classes (the example I think I gave was half and full width
> katakana characters), but the two given are the most common. In other
> words we can not please all people (or all locales) all of the time, but we
> can try to please most of them most of the time, while using designs that
> permit alternative semantics without having to rewrite the whole class.

I'm not sure I understand the details, but it sounds like you're saying
there's room to non-invasively extend the design to cover new equivalence
classes. If so, I'm happy.

> BTW as a matter of practice, the original version of regex++ did not have
> any case distinctions built in, instead a custom traits class was required
> to handle this - it quickly became apparent that this was not the way to go
> - on grounds of both utility and particularly code bloat.

Experience speaks loudly.

>> Funny thing: I always thought of knuth-morris-pratt and boyer-moore as search
>> algorithms, not types of patterns, as you have defined them. Is pattern type
>> really so intimately tied to search algorithm? For example, a regular
>> expression might be searched using NFAs or DFAs. These really seem like
>> different concepts to me.
>
> The Knuth Morris Pratt algorithm is a DFA that's computed "on the fly".
> The pattern-type is mean to encapsulate any state information required by
> the algorithm as well as the patterns grammar (if any). There are two aims
> here -
>
> 1) The pattern matching algorithms may be called repeatedly with the same
> pattern - by caching the state information in a "pattern-type" we can avoid
> the overhead of repeatedly compiling the pattern.
> 2) The pattern-types are to some degree interchangeable - we can easily
> replace knuth_morris_pratt by boyer_moore and recompile, choosing whatever
> works best for our particular situation, the same argument would apply to
> NFA vs DFA based regular expressions. We could even replace a literal
> search (boyer_moore) with a "string search with errors" algorithm by
> changing the pattern's type. In general it's much easier to change a type,
> once, in a typedef, than to define different algorithms for every kind of
> search, and then have to go though code and change these in every place
> that they occur. I'm not sure whether this answers your question, or
> whether I've misunderstood.

1. I don't buy the division of concepts you've chosen. A pattern should not
contain state information:
  a. The same pattern should be usable to search two different texts in an
interleaved fashion.
  b. A user who wrote a pattern to a file would not expect to be saving
state information.

2. It seems to me that search algorithms could well be classes instead of
functions. Then you could replace a typedef representing the search
algorithm instead of the pattern.

-Dave


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk