Boost logo

Boost :

Subject: Re: [boost] [regex, xpressive] interesting(?) perf benchmark
From: Eric Niebler (eric_at_[hidden])
Date: 2010-06-08 09:13:04

On 6/8/2010 4:22 AM, John Bytheway wrote:
> On 08/06/10 01:34, Eric Niebler wrote:
>> I haven't forgotten. IIRC, lexertl is like dynamic xpressive: it accepts
>> strings as input and builds a FSA. I was speculating about what the
>> lexertl equivalent of static xpressive would look like and whether it
>> would be even faster. Then static xpressive could use static lexertl for
>> simple regexes, and dynamic xpressive could use the dynamic one.
> I once wrote a compile-time lexer compiler, which seems to be the sort
> of thing you're suggesting. It worked like this:
> typedef lexer<
> rule<literal<'k','e','y','w','o','r','d'>, construct<KEYWORD> >,

mpl::string could help you here.

> rule<one_or_more<range<'a','z'> >, construct<ID, mpl::true_> >,
> rule<alternative_literal<' ','\n','\t','\r'>, discard >
> > keyword_lexer;
> This defines a lexer recognizing whitespace-separated lower-case words,
> but treating the word 'keyword' as special. Each rule of the lexer is a
> regex followed by what to do when that regex is matched. The first rule
> constructs a KEYWORD token, the second constructs an ID token (and the
> mpl::true_ means it passes the range which matched the regex to the
> constructor), the third matches without creating a token.
> The library will take this lexer specification and turn it into an NFA,
> then transform that into a DFA, then encode that as a transition table
> in an array, *all at compile time*.

Wow! This is something like what I had in mind.

> The down side with this approach, of course, is the absolutely massive
> compile-time cost. Compiling the above example with g++ 4.4.3 takes
> over 6 minutes and over 1.3GiB of memory (and this is after I've put in
> effort to optimize it). The time I could live with, but the memory
> requirements are absurd. Any lexer of real-world complexity would be
> quite impossible to compile.

Ouch, that's a serious problem. Maybe there is a more efficient way to
get to a DFA. Or maybe the world isn't ready for compile-time DFAs.

> However, it might just be possible to use it for simple regexes, rather
> than full-blown lexers. Anyone think that might be worth investigating?

If the compile times are that bad, I wouldn't be able to use it. I can't
use it as-is regardless, because static xpressive doesn't have
compile-time access to the string literals; they're still char*.

Eric Niebler
BoostPro Computing

Boost list run by bdawes at, gregod at, cpdaniel at, john at