Boost logo

Boost :

From: John Max Skaller (skaller_at_[hidden])
Date: 2001-06-10 09:23:16

Daniel Berlin wrote:

> > It generates _ocaml_ code, not C.

> Yeah, but it appears it's still table driven, no?

        Yes, it is table driven.
> If you hand code it in ocaml, is it smaller?

        Haven't tried.
> Making re2c output ocaml code wouldn't be all *that* difficult.


> > I doubt it. A table driven lexer can process
> > text with one random and one sequential memory access.
> Errr, except compilers can't optimize this well at all,

        You're probably right :-)

> Except you forgot the memory load and store latency from reading this
> table, which is so much more than the instructions in the engine it's
> not funny.
        There is no storing. There is a single table read,
which is effectively random, and reading the character,
which should be in the cache. So the performance bound
for the fastest possible lexer is a single random memory access.

> Why exactly do you think hand coded lexers are often faster?

        Because often, you can elide the table lookup,
and do _everything_ in registers/using constants in the code.
For example, to lex an identifier, you can use range comparisons for

        a-z, A-Z, 0-9

> Because they aren't table driven, and avoid the extra memory accesses
> inside what should be a tight loop.

        Correct. However, while it is possible to do this
by hand, most mechanically generated lexers building code
would be unlikely to spot this kind of 'smart' encoding.
> > As David Abrahams said, this is the bottom line:
> > there are a lots of characters.
> Sure. And having to do 3 memory accesses for each one instead of 1
> memory access (just the next character) kills you when you do it a
> million times.

        I think this is the wrong way to count:
I see one random read, and one sequential one.
The random read dominates, since it is likely to
be a cache miss every time.

        For an 'executable' lexer, it is possible to
reduce this time to one sequential read, which is almost
certainly going to be a cache hit: in this case,
such a lexer would fly, since it could use the cache
very effectively.

        So I don't disagree with you, I just have doubts :-)

John (Max) Skaller, mailto:skaller_at_[hidden]
10/1 Toxteth Rd Glebe NSW 2037 Australia voice: 61-2-9660-0850
checkout Vyper
download Interscript

Boost list run by bdawes at, gregod at, cpdaniel at, john at