Boost logo

Boost :

From: Hartmut Kaiser (hartmutkaiser_at_[hidden])
Date: 2004-12-28 04:38:45


 
Dave Handley wrote:

> >I'd be willing to write the interfacing stub to plug your
> library into
> >Wave.
>
> Thanks - I think this should be relatively easy because we
> currently use a
> forward iterator across the tokens in order to generate
> suitable output for
> Spirit. Clearly one of the requirements for the Boost
> community to accept
> this library in general will probably be meeting the Standard
> requirements
> for forward iterators - and I note that this is one of your
> key requirements
> in Wave.

The lexer in Wave feeds a spirit grammar (which recognises the preprocessor
statements), thus a forward iterator is sufficient.

> >The two different lexers I was using in the Wave library
> were a re2c based
> >lexer (static switch based lexer, extremly fast and compact)
> and a SLEX
> >based lexer (runtime generated DFA). I haven't done serious speed
> >measurements, but the numbers I've got so far showed similar
> timings for
> >both with a slight advantage for re2c (as expected). I'd expect your
> >library
> >to be very similar in speed as well. But just out of
> curiousity I'm very
> >interested in seeing the static DFA generation version :-)
>
> Can these lexers be effectively used with any Spirit grammar?
> I'll download
> Wave over the next few days and have a look at how much
> overlap there is
> between our library and the lexers within Wave.

Both essentially expose a functor based (lex like) interface, which when
called returns the next token. I was using the multi_pass_iterator to wrap
these into a usable iterator interface. So yes, both are usable with spirit
grammars out of the box.

> My plan for the static DFA version (which we haven't fully
> discussed in our
> internal design discussions yet so may change significantly),
> is to use the
> memento pattern to reflect the internal state of the DFA,
> along with the
> production rules.
> These could then be saved or loaded, or
> serialised to a
> C++ file, in a rather similar way to flex.

The SLEX lexer DFA tables can be saved to a file as well. But since the SLEX
doesn't allow to have production rules (it's a recognition engine only)
obviously not production rules can be saved ;-)

> This would mean that the lextl
> classes would be both an inline and an offline tool. In
> principle this
> gives some very desirable advantages:
>
> 1) The API remains the same in both compile-time and
> run-time versions.
> 2) You can very easily swap between run-time and
> compile-time versions -
> for example during development you may use run-time creation
> to speed up the
> process of developing a complex grammar, then switch to
> compile-time for
> your production release.
> 3) You have a number of options for compile-time versions.
> The grammar
> can be in a configuration file, or compiled into the program directly.

Agreed!

> As a flavour, the current API would result in some code looking like:
>
> syntax<> mySyntax;
> mySyntax.add_rule( rule<token<1> >( "Group" ) );
> mySyntax.add_rule( rule<token<2> >( "Separator" ) );
> mySyntax.add_rule( rule<token<3> >( "Switch" ) );
> // Lots more symbols for the rest of VRML...
> mySyntax.add_rule( rule<token_float<69> >(
> "([-+]?((([0-9]+[.]?)|([0-9]*[.][0-9]+))([eE][-+]?[0-9]+)?))" ) );
> mySyntax.add_rule( rule<token_string<70> >( "#[^\\n\\r]*[\\n\\r]" ) );
> mySyntax.add_rule( rule<token_string<71> >(
> "\"[^\"\\n\\r]*[\"\\n\\r]" ) );
> mySyntax.add_rule( rule<token_string<72> >(
> "[a-zA-Z_][a-zA-Z0-9_]*" ) );
> mySyntax.add_rule( rule<token<73> >( "[ \\t\\n\\r]+" ) );
> lexer<> myLexer( mySyntax );
> std::ifstream fsp( "test.vrml" );
> myLexer.set_source( fsp ); // Also it will have a char
> iterator interface
> lexer<>::token_iterator iter = myLexer.begin();
> for ( ; iter != myLexer.end(); ++iter )
> // Do something with the tokens...
> ;

How do you specify production rule's? Your above sample specifies how to use
the recognition part only ;-)
The numbers are the token ID's, I assume?

> Essentially you create a syntax to which you add rules. These rules
explain
> the token that will be created by the rule, and a regular expression to be

> matched. Special versions of tokens exist that will automatically
generate
> a float, int or string from the matched expression. You then
> construct a lexer off the syntax, set a suitable source input, and then
> iterate over the tokens. The tokens are polymorphic so you can easily
access
> the type and data, they are statically typed so you can use types to match

> the tokens, or you can dispatch tokens to functions through a visitor
> framework.

Is your token type a predefined one or is it possible to use my own token
types instead?
I'm asking, because Wave in principle allows to use any token type for its
work, but it requires the tokens to expose a certain (very small) interface.

> You can also use dynamically typed tokens as well if you want -
> although you lose much of the benefit of strong typing. I especially like
the
> idea of using visitors to distinguish between tokens. This means that in
> my example of a VRML file, the long float lists can be handled with a
visitor
> that contains visit functions for a float token, a comma token (note that
> in vector lists in VRML the elements in a 3D vector are separated by
> whitespace, and the 3D vectors themselves separated by commas) and a list
> termination token (eg a "]"). The visitor would then automatically store
the floats
> in a suitable data structure - like a list of 3D vectors.

Sounds good!
Regards Hartmut


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk