Boost logo

Boost :

From: Vladimir Pozdyayev (pvv_at_[hidden])
Date: 2004-12-22 03:18:29


John Maddock wrote:

> This is an area I want to explore though, if I can get this next lot of code
> out the door, then I'll create a cvs branch to experiment with this, if you
> want to suggest / experiment with a design for the abstract creator in the
> meantime, then go for it.

Finally I got down to refactoring the ANFA code. Take a look at
http://groups.yahoo.com/group/boost/files/anfa-regex/anfa091.zip
(once again, this is not a fullscale regex library... yet)

The DESIGN file content is appended to this message.

-- 
Best regards,
Vladimir Pozdyayev.
----------------------------------------------------------------------
-= Regex Design Issues =-
The core classes.
* charset
    Provides "bool operator( character )". Nothing much to say apart
    from that, but do see the Charset Issues section.
* charset creator
    Supports arbitrary charset expressions (within the limits of a
    given set of possible operations). Implementations, however, are
    not required to provide _all_ the declared functionality; calls
    for unsupported features should result in appropriate exceptions.
    Also provides the "void create( charset & )" function which is
    used to initialize a charset newly created by "matcher creator".
    The "abstract_charset_creator" class provides stubs for all
    expression elements possibly to be requested by regex parsers.
    Implementations with limited functionality can inherit them and
    redefine only those functions that should actually do something
    useful.
* matcher
    Provides the low-level matching functionality, say, finding the
    first occurrence of bla-bla-bla. On the other hand, replacing all
    occurrences is a high-level action, for it consists of (1)
    finding them and (2) creating a modified string---so it should go
    into the "regex" class. (On the other other hand, if it is
    possible to do replacement on the fly while searching, this
    becomes a low-level action. I don't know if it can be done in a
    sufficiently general way, however.)
* matcher creator
    Like charset creator, only for matchers.
* parser
    The syntax parser. Takes the input string in the form of
    begin-end iterators, and issues a sequence of charset/matcher
    creator calls ending with "matcher_creator::create" (or an
    exception). In essence, simply provides the function
    "void parse( matcher &m, iterator begin, iterator end )".
    A parser must be consistent with the properties of "creator"
    classes.
* regex
    A wrapper for the "matcher" class. Provides the high-level
    creation & utilization routines.
How they are connected.
    A sample from "regex.cpp":
    typedef basic_regex<
        basic_simple_parser<
            basic_charset_creator< wchar_t >,
            basic_anfa_matcher_creator< basic_charset< wchar_t > >,
            basic_anfa_matcher< basic_charset< wchar_t > >
        >,
        basic_anfa_matcher< basic_charset< wchar_t > >
    > regex;
On "creators" and "create" functions.
    The name is somewhat misleading, since they fill target objects
    with compiled data rather than create them. Still, "charset
    compiler" sounds a bit weird... or does it?
    Anyway. The "creators" are subject to the following uses and
    requirements. They must be able to destruct themselves gracefully
    even if the expression they are being feeded with is only halfway
    done (in case someone has thrown an "unsupported" exception). The
    "create" function must do an implicit "pop" from the expression
    stack, so that the "creator" object could be reused. The "create"
    function can assume there's only one top-level expression tree
    node on the stack.
-= Charset Issues =-
    (Should I rename them to "character sets" for consistency with
    the full-names style?)
    All the above templates have quite a freedom in intermixing
    different character types, let alone different character
    encodings. E.g., the sample program has all regex templates
    instantiated with wide characters, but the regex string itself is
    char-typed. This clearly needs to be controlled.

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk