Boost logo

Boost :

From: John Maddock (John_Maddock_at_[hidden])
Date: 2000-01-09 07:58:26


Dave,

Thanks for taking the time to respond:

>Just perusing, at first glance:

    template<class charT>
    bool pattern_match(const charT* ptr, const pattern_type& pat);

    Returns the result of pattern_match(ptr, ptr +
        pattern_type::traits_type::length(ptr), pat);

That final expression looks like a misuse of traits. What if pattern_type
were const char*? Don't you mean:

    traits<pattern_type>::length(ptr)<

Um, yes that's an error, a particular implementation could be written that
way, but in general how the algorithm determines the length of the string
should be undefined for anything other than char and wchar_t, well spotted.

>The distinction of case-sensitivity, while useful, opens up a whole can of
worms. For example, what about diacritical marks? Often, they should be
ignored. I don't know what other language/alphabet-dependent features might
exist out there...<

I agree, however I think that it's a neglect to simply ignore the issue -
even if the "best solution" - whatever that is - is imperfect. It is up to
the implementer of a particular pattern type to decide what this means and
document it - in the case of the sample classes all locale dependent code
is abstracted out to a traits class - users that require differing
semantics from the default can obtain this by providing their own traits
class that handles the distinctions. To put this another way, consider
each character in a "pattern" as a member of two equivalence classes - one
case sensitive, one case insensitive - each member of an equivalence class
is treated as the same as any other member of the same class. By
incorporating case into the spec design, there is a way to choose which of
these two classes we want to deal with - of course there may be other
equivalence classes (the example I think I gave was half and full width
katakana characters), but the two given are the most common. In other
words we can not please all people (or all locales) all of the time, but we
can try to please most of them most of the time, while using designs that
permit alternative semantics without having to rewrite the whole class.
BTW as a matter of practice, the original version of regex++ did not have
any case distinctions built in, instead a custom traits class was required
to handle this - it quickly became apparent that this was not the way to go
- on grounds of both utility and particularly code bloat.

>
Funny thing: I always thought of knuth-morris-pratt and boyer-moore as
search algorithms, not types of patterns, as you have defined them. Is
pattern type really so intimately tied to search algorithm? For example, a
regular expression might be searched using NFAs or DFAs. These really seem
like different concepts to me.<

The Knuth Morris Pratt algorithm is a DFA that's computed "on the fly".
The pattern-type is mean to encapsulate any state information required by
the algorithm as well as the patterns grammar (if any). There are two aims
here -

1) The pattern matching algorithms may be called repeatedly with the same
pattern - by caching the state information in a "pattern-type" we can avoid
the overhead of repeatedly compiling the pattern.
2) The pattern-types are to some degree interchangeable - we can easily
replace knuth_morris_pratt by boyer_moore and recompile, choosing whatever
works best for our particular situation, the same argument would apply to
NFA vs DFA based regular expressions. We could even replace a literal
search (boyer_moore) with a "string search with errors" algorithm by
changing the pattern's type. In general it's much easier to change a type,
once, in a typedef, than to define different algorithms for every kind of
search, and then have to go though code and change these in every place
that they occur. I'm not sure whether this answers your question, or
whether I've misunderstood.

- John.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk