Boost logo

Boost :

From: Dave Handley (dave_at_[hidden])
Date: 2004-12-28 18:00:11


Hartmut Kaiser wrote:

>Since tokens have to be copied around in a typesafe manner I'd have
>expected
>to have something similar to the boost::variant, but maybe I've got you
>wrong... How do you plan to achieve this? Copying through a base class
>won't
>work AFAICT?

In my experience, copying around tokens is a major source of slow-down in
lexers. The last lexer I wrote, I had to abandon the token class and use a
C style union containing ints, doubles and std::string pointers. At the
time, I was a little restricted in the amount I could use boost by my
company, so I couldn't use boost.variant. Since then I've been putting a
lot of thought into this problem and come up with one solution. We haven't
completed integration testing with Spirit and other libraries yet using this
technique (and I can foresee some issues already :-( ).

The solution uses flyweighted tokens. The actual token exists in the rule,
and during parsing, if the token has an assigned value, this is copied into
the flyweight, then the token is passed around by reference/pointer. When
you come to use the token in the parser you have a number of options:

1) Use it and then throw it away as soon as you call the ++ operator on
the token_iterator.
2) Get the data out and do something with it - for example put it in a
list/deque, etc.
3) Copy the token and use the copied version. For this final option, the
token supports a virtual clone function to create copies of the flyweighted
version for use elsewhere.

An implementation point here is that the iterators contain a pointer to the
token so that the iterator can be copied, or have the equals operator
called, etc.

There are some other potential solutions. However, any other solution
leaves the problem of how to allocate and deallocate tokens on the heap very
fast. I considered using Alexandrescu's small object allocator from the
Loki library, and have also briefly looked at a boost.pool, but I am not
comfortable with either of these solutions, because I am convinced that it
will add too much of a burden to token creation. In my last lexer I found
one or two orders of magnitude of speed improvement when I removed the token
class and replaced it with a union.

Since our first priority on this project is speed, I do not want the
framework to slow down the DFA at all. The critical path in execution
should definitely be the DFA.

Dave Handley


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk