|
Boost : |
Subject: Re: [boost] [regex, xpressive] interesting(?) perf benchmark
From: John Bytheway (jbytheway+boost_at_[hidden])
Date: 2010-06-08 04:22:23
On 08/06/10 01:34, Eric Niebler wrote:
> On 6/7/2010 8:01 PM, Joel de Guzman wrote:
>> On 6/8/10 12:10 AM, Eric Niebler wrote:
>>> An interesting research project would be to investigate whether
>>> xpressive's hybrid static/dynamic NFA design can be applied to DFAs
>>> as well, with similar perf wins. This would be a non-trivial
>>> undertaking, like completely reimplementing xpressive.
>>
>> Just to remind you, there's a very good DFA implementation
>> underneath Spirit Lex. The interface is not perfect (Ben has
>> requested for some input in that area), but the implementation is
>> very good. Spirit Lex is proto on top of lexertl (Ben's work). We
>> find it to be one of the fastest lexer, close to performance to the
>> fastest (RE2C) which is a code-gen that generates a humongous
>> switch.
>
> I haven't forgotten. IIRC, lexertl is like dynamic xpressive: it accepts
> strings as input and builds a FSA. I was speculating about what the
> lexertl equivalent of static xpressive would look like and whether it
> would be even faster. Then static xpressive could use static lexertl for
> simple regexes, and dynamic xpressive could use the dynamic one.
I once wrote a compile-time lexer compiler, which seems to be the sort
of thing you're suggesting. It worked like this:
typedef lexer<
rule<literal<'k','e','y','w','o','r','d'>, construct<KEYWORD> >,
rule<one_or_more<range<'a','z'> >, construct<ID, mpl::true_> >,
rule<alternative_literal<' ','\n','\t','\r'>, discard >
> keyword_lexer;
typedef keyword_lexer::token_type token;
typedef keyword_lexer::token_id_map id_map;
BOOST_MPL_ASSERT_RELATION(
(mpl::at<id_map, KEYWORD>::type::value),==,0
);
BOOST_MPL_ASSERT_RELATION((mpl::at<id_map, ID>::type::value),==,1);
string s("foo bar\t\tkeyword \n\r baz keywordo ");
keyword_lexer l(s);
vector<token::ptr> tokens;
copy(l.begin(), l.end(), back_inserter(tokens));
BOOST_REQUIRE_EQUAL(tokens.size(), 5);
BOOST_CHECK_EQUAL(tokens[0]->id(), l.get_id<ID>());
BOOST_CHECK_EQUAL(tokens[2]->id(), l.get_id<KEYWORD>());
BOOST_CHECK_EQUAL(tokens[4]->id(), l.get_id<ID>());
This defines a lexer recognizing whitespace-separated lower-case words,
but treating the word 'keyword' as special. Each rule of the lexer is a
regex followed by what to do when that regex is matched. The first rule
constructs a KEYWORD token, the second constructs an ID token (and the
mpl::true_ means it passes the range which matched the regex to the
constructor), the third matches without creating a token.
The library will take this lexer specification and turn it into an NFA,
then transform that into a DFA, then encode that as a transition table
in an array, *all at compile time*. On reflection, it probably would
have been better (and faster) to use the giant-switch-statement
approach, rather than an array-based transition table.
Then we give the lexer a string to parse (any forward range of
characters would do) and we can access the parsed tokens (in a
type-erased fashion) by treating the lexer itself as a range. Obviously
going to all this trouble to do everything at compile time and then
offering only a type-erased interface is a bit silly, but there are
other ways (you can give the lexer a UnaryFunction which it calls on the
next token).
The down side with this approach, of course, is the absolutely massive
compile-time cost. Compiling the above example with g++ 4.4.3 takes
over 6 minutes and over 1.3GiB of memory (and this is after I've put in
effort to optimize it). The time I could live with, but the memory
requirements are absurd. Any lexer of real-world complexity would be
quite impossible to compile.
However, it might just be possible to use it for simple regexes, rather
than full-blown lexers. Anyone think that might be worth investigating?
John Bytheway
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk