Boost logo

Boost :

From: Chris Lattner (clattner_at_[hidden])
Date: 2007-09-01 21:35:35


On Sep 1, 2007, at 2:14 PM, David Abrahams wrote:
> on Sat Sep 01 2007, Chris Lattner <clattner-AT-apple.com> wrote:
>
>> Can you give me a specific example of where making the entire lexer a
>> template would be more useful than adding a couple of virtual method
>> calls? It isn't meaningful to lex anything other than a sequence of
>> char's
>
> You're serious? It's meaningless to lex UTF-32?

It's not meaningless, but templatizing the entire lexer (which
includes the preprocessor) is not necessarily the best way to achieve
this. Consider three possible design points:

1. Templatize the entire lexer and preprocessor on the input encoding.
2. translate UTF-32 (and every other encoding) to UTF-8 in a prepass
over the buffer. Lex the resulting buffer as UTF-8.
3. Make the low level "getchar" routine in the lexer do a dynamic
switch on the input encoding type, translating the input encoding
into the internal encoding on the fly.

These three design points all have strengths and weaknesses. For
example:

#1 is the best solution if your compiler is known to only statically
cares about a single encoding - the single instantiation is obviously
going to be performance optimal for any specific encoding, and the
code size/compile time will be comparable to a non-templated
implementation. However, if it has to handle multiple encodings (as
is typical for most real compilers), it requires one instantiation of
a large amount of code for a large number of encoding types. This is
a serious compile time (of the lexer) and code size problem.

#2 is the best solution if your compiler almost always reads ASCII or
UTF-8, but needs to be compatible with a wide range of encodings. It
features low compile time (of the lexer) and is optimal for UTF-8 and
compatible encodings. The bad part about it is that non-compatible
encodings must have an expensive prepass over their code, which can
be a significant performance problem.

#3 is the best solution for a compiler that reads a wide variety of
encodings and there is no common case. It features low compile time,
but obviously has worse performance than #1. One interesting aspect
of it (which is often under-appreciated) is that common processors
will perfectly predict the encoding-related branches in the inner
scanning loop, so the performance impact would be much lower that may
be expected.

My point isn't to tell you what I think is the optimal solution - I'm
just trying to point out that a very important part of software
engineering is choosing the right design point for a problem. I love
C++ as a language because it gives me many tools to choose from to
solve a specific problem such as this.

Compiler design in particular doesn't have any "magic": it's all just
a series of design decisions based on various engineering tradeoffs.
Being aware of and carefully considering all of the options is the
best way to make a high-quality product IMO.

>> for example, and efficient buffer management (at least in our
>> context) means that the input to the lexer isn't useful as an
>> iterator interface.
>
> Well, the kind of input sequence is exactly one thing I would
> templatize.

To what benefit? In practice, clang requires its input to come from
a nul terminated memory buffer (yes, we do correctly handle embedded
nul's in the input buffer as whitespace). Here are the pros and cons:

Pros: clang is designed for what we perceive to be the common case.
In particular, mmap'ing in files almost always implicitly null
terminates the buffer (if a file is not an even multiple of a page
size, most major OS's null fill to the end of the page) so we get
this invariant for free in most cases. Memory buffers and many
others are also easy to handle in this scheme.

Futher, knowing that we have a sequential memory buffer as an input
makes various optimizations really trivial: for example our block
comment skipper is vectorized on hosts that support SSE or Altivec.
Having the nul terminator at the end of the file means that the lexer
doesn't have to check for "end of buffer" condition in *many* highly
performance sensitive lexing loops (e.g. lexing identifiers, which
cannot have a nul in them).

Cons: for files that are exactly a multiple of the system page size,
sometimes you have to do a dynamic allocation and copy the buffer to
nul terminate the input. Likewise, if you want to support an
arbitrary input iterators then you have to copy the input range into
a memory buffer before you start.

Personally, I can't think of a situation where lexing C code from an
arbitrary input iterator range would be useful. I'll assume you can
however: in this case, copying the input into a buffer before you
start seems quite likely to give *better* performance than a fully
parameterized the lexer would. By restricting the lexer to only
being able to assume input iterators, you force it to check for end
of stream after it reads *every* character, you effectively prevent
vectorization and other optimizations, and you are more at the mercy
of the compiler optimizer to produce good code.

-Chris


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk