Boost logo

Boost :

From: Hartmut Kaiser (hartmut.kaiser_at_[hidden])
Date: 2005-08-20 16:45:36


Daryle Walker wrote:

Sorry for the late answer, your questions required some investigation to be
answered correctly.

> Wave is our C++ preprocessor, but preprocessing is the third
> phase of translating a file. (Looking at section 2.1 in the
> standard). I have a gut feeling that all the compilers out
> there mush the first three phases together in parsing a file.
> Glancing over the Wave docs gives me the same impression
> about it. Are either one of these feelings accurate (this
> requires a separate answer for each parser)? If the answer
> for Wave is "yes", could we separate them, at least as an
> option? I feel that this is important so we can gain full
> understanding of each phases. It may be more complicated[1],
> and most likely slower, but it could represent a clean
> implementation. (BTW, what phases does Wave act like?)

Wave internally doesn't distinguish explicitely between the different phases
mandated by the Standard. This is probably similar to most other compilers
out there. Wave currently is built up out of two separate layers (I don't
call them phases to avoid confusion with the Standard phases). A lexing
layer and a preprocessing layer.

What you will have to know when you're going to use Wave is, that it does
not act on preprocessing tokens but directly on the C++ token level. This
might be a wrong design decision in the beginning, but it allows to expose
the full set of C++ tokens as defined by the standard from Wave's iterator
interface without having to rescan (retokenize) the generated preprocessed
output. Additionally, I wanted to have the lexing components to generate C++
tokens to make them usable in other contexts where no preprocessing is
required.

The drawback of this design is that Wave
- doesn't conform to the standard in this regard
- currently doesn't fully support the handling of preprocessing numbers as
mandated by the Standard.
This is rearly an practical issue though, since many uses of preprocessing
numbers are handled correctly anyway.

The first (bottom) layer in Wave generates the C++ tokens. These are
generated by a lexing component exposing them through an iterator interface.
This lexing component implements the compilation phases 1 and 2.

There are two different lexing components for supporting the full set of C++
tokens, both usable separately without the preprocessing layer described
below. As I've said, you get C++ tokens at this level already. The
difference between these lexing components is implementation wise only.
Their implementations are using different lexer generator toolkits (re2c an
slex). I have a xpression based lexer here as well, but this needs some
additional work.

Phase 3 is not implemented in Wave as outlined above. Wave generates C++
tokens instead.

The second layer in Wave is the preprocessing layer. It uses the C++ tokens
generated by the lexer to
- recognise the preprocessing directives and execute them
- recognise identifiers representing macro invocations and expands these
That corresponds to phase 4.

Phase 5 and above are not implemented in Wave.

> The first two[2] phases are:
>
> 1. Native characters that match basic source characters are
> converted as so (including line breaks). Trigraphs are
> expanded to basic source[3]. Other characters are turned
> into internal Unicode expansions (i.e. act like "\uXXXX" or
> \Uxxxxxxxx"[4]).
> 2. The backslash-newline soft line-break combination are
> collapsed, folding multiple native lines into one logical
> line. We should spit out an error if the folding creates
> Unicode escapes. For non-empty files, we need to spit out
> errors if the last line is not a hard line-break, either a
> non-newline character or a backslash-newline combination is forbidden.

This is done except the test for invalid unicode characters resulting from
line collapsing. Generally Wave is not unicode aware. I'd like to use a
future Boost library for that.

> [1] Our "Wave-1" would convert the original text (iterators)
> into phase-1 tokens. Our "Wave-2" would convert phase-1
> token (iterators) into phase-2 tokens, etc. Remember that
> any file-name and line/column positions will have to be
> passed through each phase.

Wave currently follows exactly this described design except that it does not
apply it to compilation phases as outlined by the Standard but to layers as
outlined above. Both the lexing and the preprocessing layer provide tokens
through an iterator interface.

> [2] I thought Wave just did phase-3, with phases 1 and 2
> thrown in at the same time. But now I'm not sure which phase
> Wave stops at. I don't think it can go past phase-4, because
> doing phase-5 needs knowledge of the destination platform.

Yes. Wave stops at phase 4.

> [3] Only '?' characters that are part of a valid trigraph
> sequence are converted; all others are left unchanged.

Yes. This is done as expected.

> [4] But actual "\uXXXX" resolution doesn't happen until phase 5!

Wave treats \uxxxx and \Uxxxxxxxx as single characters but doesn't care
about it's semantics.
The only things it verifies are:
- that these have valid values (as described in Annex E of the Standard)
- that token concatenation does not produce invalid \uxxxx or \Uxxxxxxxx
character values.

Generally speaking, I agree with you that now as Wave is part of Boost it
should conform to the Standard in this regard as well. It will result in a
major rewrite of some parts of Wave, though.

Regards Hartmut


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk