Boost logo

Boost :

From: Dave Handley (dave_at_[hidden])
Date: 2004-12-27 19:56:59


Hartmut Kaiser wrote:

>I'd be willing to write the interfacing stub to plug your library into
>Wave.

Thanks - I think this should be relatively easy because we currently use a
forward iterator across the tokens in order to generate suitable output for
Spirit. Clearly one of the requirements for the Boost community to accept
this library in general will probably be meeting the Standard requirements
for forward iterators - and I note that this is one of your key requirements
in Wave.

>The two different lexers I was using in the Wave library were a re2c based
>lexer (static switch based lexer, extremly fast and compact) and a SLEX
>based lexer (runtime generated DFA). I haven't done serious speed
>measurements, but the numbers I've got so far showed similar timings for
>both with a slight advantage for re2c (as expected). I'd expect your
>library
>to be very similar in speed as well. But just out of curiousity I'm very
>interested in seeing the static DFA generation version :-)

Can these lexers be effectively used with any Spirit grammar? I'll download
Wave over the next few days and have a look at how much overlap there is
between our library and the lexers within Wave.

My plan for the static DFA version (which we haven't fully discussed in our
internal design discussions yet so may change significantly), is to use the
memento pattern to reflect the internal state of the DFA, along with the
production rules. These could then be saved or loaded, or serialised to a
C++ file, in a rather similar way to flex. This would mean that the lextl
classes would be both an inline and an offline tool. In principle this
gives some very desirable advantages:

1) The API remains the same in both compile-time and run-time versions.
2) You can very easily swap between run-time and compile-time versions -
for example during development you may use run-time creation to speed up the
process of developing a complex grammar, then switch to compile-time for
your production release.
3) You have a number of options for compile-time versions. The grammar
can be in a configuration file, or compiled into the program directly.

>at least you could have used the symbol parser ... <snip>

I've changed the Spirit grammar to use the symbol parser, and dropped the
time from 40-50 seconds to about 6-7 seconds. Thanks for the tip.

>Is there any documentation available?

No, I'm sorry, we don't have any documentation at present, because we are
still only a little beyond a prototyping stage. We still have to refine the
API a little, and complete some optimisation work before we publish an early
Alpha. As I said in my original post, I'm trying to gauge interest at the
moment, and I'm pleased to see that we are generating a little interest.

As a flavour, the current API would result in some code looking like:

syntax<> mySyntax;
mySyntax.add_rule( rule<token<1> >( "Group" ) );
mySyntax.add_rule( rule<token<2> >( "Separator" ) );
mySyntax.add_rule( rule<token<3> >( "Switch" ) );
// Lots more symbols for the rest of VRML...
mySyntax.add_rule( rule<token_float<69> >(
"([-+]?((([0-9]+[.]?)|([0-9]*[.][0-9]+))([eE][-+]?[0-9]+)?))" ) );
mySyntax.add_rule( rule<token_string<70> >( "#[^\\n\\r]*[\\n\\r]" ) );
mySyntax.add_rule( rule<token_string<71> >( "\"[^\"\\n\\r]*[\"\\n\\r]" ) );
mySyntax.add_rule( rule<token_string<72> >( "[a-zA-Z_][a-zA-Z0-9_]*" ) );
mySyntax.add_rule( rule<token<73> >( "[ \\t\\n\\r]+" ) );
lexer<> myLexer( mySyntax );
std::ifstream fsp( "test.vrml" );
myLexer.set_source( fsp ); // Also it will have a char iterator interface
lexer<>::token_iterator iter = myLexer.begin();
for ( ; iter != myLexer.end(); ++iter )
    // Do something with the tokens...
    ;

Essentially you create a syntax to which you add rules. These rules explain
the token that will be created by the rule, and a regular expression to be
matched. Special versions of tokens exist that will automatically generate
a float, int or string from the matched expression. You then construct a
lexer off the syntax, set a suitable source input, and then iterate over the
tokens. The tokens are polymorphic so you can easily access the type and
data, they are statically typed so you can use types to match the tokens, or
you can dispatch tokens to functions through a visitor framework. You can
also use dynamically typed tokens as well if you want - although you lose
much of the benefit of strong typing. I especially like the idea of using
visitors to distinguish between tokens. This means that in my example of a
VRML file, the long float lists can be handled with a visitor that contains
visit functions for a float token, a comma token (note that in vector lists
in VRML the elements in a 3D vector are separated by whitespace, and the 3D
vectors themselves separated by commas) and a list termination token (eg a
"]"). The visitor would then automatically store the floats in a suitable
data structure - like a list of 3D vectors.

Dave


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk