Boost logo

Boost :

Subject: [boost] template-defined regexp proposal
From: Vit Stepanek (vit.stepanek_at_[hidden])
Date: 2011-01-30 18:30:42


Hi,

I've implemented the basic regexp functionality using few simple
template classes. Any regexp can be created by inserting the template
classes one into another in the required order.
Although it's in the design state, I'd like to find out if there's any
interest in providing this functionality.

Let me present you some short introduction to the topic.

Some basic classes include
- String Match (full string matching)
- Is-in Match (matches any character in the list)
- Or Match
- Quantity

and so on.

These are the "building blocks" of the regexp. Each class does a very
little bit but when combined in the specific order, you gain a clearly
defined string matching algorithm of any kind and any complexity.

In addition, it's quite simple to add any missing functionality. You
just need to define you're desired algorithm and mix it with the others.

I add few examples that should describe how's it all done.

Regexp: ^[a-z]+
is implemented as:

Quantity<
        Range< char_a, char_z >,
        1
> re;

(sorry if you don't like the indentation, but that's just to ease the
reading)

Regexp: ^0(x|X)[a-fA-F0-9]+ (matching the hexadecimal number)

The required strings are hardcoded using macro "STRHOLDER", which
generates a simple struct containing the given string (to avoid the need
of pushing strings in the runtime and thus splitting the matching logic
into two levels).

STRHOLDER( 0, "0" );
STRHOLDER( xX, "xX" );

MultiTie<
        StrMatch< StringHolder_0 >,
        IsIn< StringHolder_xX >,
        Quantity<
                OrMatch<
                        DigitChar,
                        OrMatch<
                                Range< char_a, char_f >,
                                Range< char_A, char_F >
>
>,
                1
>
> re;

The "MultiTie" ties the match parts to the consecutive sequence.

Some other examples I tested include
- email address regexp
- quoted string matching (containing escaped quoting chars)

Currently the implementation counts 21 template classes, which are enough
to implement (I hope) almost any regular expression.
Each "brick" is called using operator ()(). That allows to pass it as an
argument to the standard algorithms (like std::replace_if etc).

I've already implemented the availability to push parameters to the
regexp in the runtime (but that's not the major feature).

Some rationale for the defense:

String matching is widely applied part of the coding. Often it's also
one of the problematic domains (especially when using self-implemented
matching functions without any general solution). Once it's done, it can
be hard to change it, because all what can be seen is HOW is the match
done, but not WHAT does it do. Even when using some higher-level
methods, it still isn't self-descriptive.
Template-defined regexp provides one solution for this issue by hiding
the HOW under the mask of WHAT, and taking the advantage of compile-time
check. Only to mention that - the performance is not touched by any kind
of interpreting - everything is prepared after compilation (but I don't
want to touch anybody's baby here!).

Please don't look at:
- the proposed name "template-defined regexp", I use it as the temp.
name
- the naming of the classes
- implementation details (yes, they aren't here, because it's not
necessary for now)
- term "regexp" - I'm not sure if this can be called so...

Your opinions are welcomed!

Vit Stepanek
------------


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk