Boost logo

Boost :

From: Paul Mensonides (pmenso57_at_[hidden])
Date: 2003-12-03 20:18:26


> -----Original Message-----
> [mailto:boost-bounces_at_[hidden]] On Behalf Of Dan W.

> Hi again, Paul. I'm sorry to bother so much. I forwarded your
> reply to
> Walter, of Digital Mars, and he still doesn't seem convinced.
> I am 99%
> convinced but I need to understand this fully if I'll ever be able to
> convince him :-)

Okay.

> Let me try to put what you said in my own words, and then
> I'll ask you a
> pointed question:
>
> I always thought, myself, that the preprocessor worked at a
> textual level.
> Now I'm getting the idea that the first pass in preprocessing is
> 'tokenization' which I take it to mean tagging tokens, like
> "abracadabra"
> or "." or ")". So, if we have,

Conversion from text to preprocessing tokens occurs during translation phase 3.
Macro expansion does not occur until phase 4. (Phases 1 and 2 are for handling
of trigraphs, universal character names, etc., and line-splicing caused by
backslash-newline.)

> #define a(x) x
> a(something).else
>
> outputting
>
> something.else
>
> is NOT 'concatenation', as no tokens are being merged.

Yes.

> Could we say, then, that ## was invented as a special power tool for
> 'violating initial tokenization' and forcing a merger of two
> tokens?

For the most part, yes. Tokenization is already complete at the point of macro
replacement. Token-pasting operates on two preprocessing tokens and produces a
single preprocessing token (if it doesn't, it is undefined behavior). So,
you're correct in a sense, but it doesn't exactly "violate" tokenization. It is
simply a means to merge two existing preprocessing tokens into one preprocessing
token--there is no other way to do this.

> If so, we could say that in the example above, the
> tokens are not being
> violated, and therefore ## is not needed.

Violated is not the right word, but I'll explain this in detail below.

> Now the pointed question: What then if I write,
>
> a(something)else
>
> without a "."? Should or shouldn't the preprocessor add a
> space in this
> particular case?

Strictly speaking, it shouldn't. *However*, if the preprocessor was doing
_preprocessing only_ (i.e. text stream -> text stream), it would have to insert
a space in order to prevent incorrect retokenization by another tool (i.e.
compiler). This is only a hack though. It is a workaround to account for the
extra retokenization that doesn't occur during the normal phases of translation.
This extra step must be introduced after macro expansion, after file inclusion,
etc. (i.e. after the preprocessor does all its normal stuff) in order to
preserve the semantics of the original program as if it followed the phases of
translation to the letter.

Consider the following sequence of preprocessing tokens:

a + b

There are only three preprocessing tokens, but there are five "entities" that
the preprocessor must preserve (whitespace has no significance between entities
in the following sequences; only the entities within the < > delimiters matter):

<a> < > <+> < > <b>

Once you are operating at the token level, rather than textual representation,
two preprocessing tokens can exist side-by-side without any form of merging.
Consider:

#define a(x) x

a(something)else

The result is (and must be):

<something> <else>

I.e. two adjacent preprocessing tokens with no intervening whitespace entities.
OTOH, in:

a(something) else

The result is (and must be):

<something> < > <else>

Similarly:

#define id(x) x

#define a(x, y) id(x)id(y)

a(+,+)

<+> <+> // not <++> and not <+> < > <+>

Moral of the story is, there is absolutely no problem with adjacent
preprocessing tokens whose spellings, if retokenized, would result in a single
preprocessing token or in different preprocessing tokens. E.g.

<+> <+> // ++ retokenized is <++>

<+> <==> // +== retokenized is <+=> <=>

The point being, that that retokenization isn't supposed to happen. Each
"preprocessing token" is converted directly to a "token". There is no
intervening retokenization step. Note that the difference between a
"preprocessing token" and a "token" is that a preprocessing token is
significantly looser, particularly regarding numeric literals. Also, some
things are preprocessing tokens but are not tokens, such as the backslash here:

a \ b

The preprocessor has no problem with dealing with the above as preprocessing
tokens (with whitespace entities between them to be accurate), but the program
is ill-formed if a preprocessing token needs to be converted to a token but
can't be, as is the case with the backslash above. Another such example is:

0.0.0

...which is a single pp-number preprocessing token. However, if that
preprocessing token ever reaches a point where it gets converted to a token, the
program is ill-formed because a token cannot be formed from such a pp-number.
(The conversion from preprocessing tokens to tokens occurs in three places: the
expression of an #if directive, the expression of an #elif directive, and when
"preprocessing" is finished.)

Anyway, back to the original header name example. If you have the following:

#define a(x) b(x).h
#define b(x) x

The above doesn't *guarantee* that no whitespace exists before the period.
However, it does guarantee that no whitespace is introduced by the above macros
alone. If whitespace is to exist after macro expansion, it would have to come
in with the argument x. Such as:

#define empty()

a(file empty()) // file .h or <file> < > <.> <h>

In this case, the spellings of the preprocessing tokens are retokenized to form
a header-name preprocessing token, but that only happens on #include directives
when macro replacement is involved and that only happens when the original
tokens of the #include directive cannot be interpreted as a header-name (either
<...> or " ") to begin with. For example, the following two include directives
are interpreted drastically different ways:

#define empty()

#include <file.h>
#include empty() <file.h>

The first #include directive (excluding the #include itself) is tokenized as as
single header-name preprocessing token of the form <h-char-sequence>. No macro
replacement occurs. The second #include directive, OTOH, is interpreted as:

<empty> <(> <)> < > <<> <file> <.> <h> <>>

Which then undergoes macro replacement yielding:

< > <<> <file> <.> <h> <>>

which is than reinterpreted as a header-name. (If this reinterpretation is
impossible, you get undefined behavior.)

The above example that introduces a whitespace in the argument could also be
illustrated without the use of empty():

a(file ) // file .h

Here the standard is a bit unclear. It does not say that leading and trailing
whitespace is removed from the sequence of preprocessing tokens that makes up an
argument to a function-like macro. However, it makes sense to do so because the
arguments to a macro are identified in a separate step before the macro is
actually invoked. That is why this works correctly:

#define comma ,
#define rparen )
#define id(x) x

id(,) // error: too many arguments
id(comma) // okay, results in ","
id(a rparen b) // okay, results in "a ) b" not "a b)"

These work this way because the arguments to a function-like macro are
identified *before* the macro is invoked and *before* the arguments themselves
undergo macro replacement. Because this identification occurs, it is reasonable
to assume that leading and trailing whitespace should be removed from the
argument by the preprocessor--but the standard does not say that. Hence, the
only way to guarantee that no leading or trailing whitespace exists on some
arbitrary sequence of preprocessing tokens 'x' is to use token-pasting to a
placemarker (which isn't yet well-defined in C++).

So, it is a bit unclear if whitespace can be picked up from the beginning or end
of an argument, but it is perfectly clear that whitespace can be introduced via
macro expansion when the argument undergoes macro replacement. Therefore, the
above only header-name stuff is only a partial guarantee, and could still fail.
The only way that this situation could be completely handled in current C++
(i.e. w/o placemarkers) is to never put leading or trailing whitespace in the
arguments to a macro invocation...

MACRO(1,2) // not: MACRO(1, 2)

... and also to never introduces such whitespace in the replacement lists of
macros that are called. That solution is unreasonable because you wouldn't be
able to do backslash-newline with indentation in a complex replacement list:

#define A(x, y) \
    B( \
        C(x), \
        C(y) \
    ) \
    /**/

Instead, you'd have to either 1) not use backslash-newline regardless of the
complexity of the replacement list or 2) not use any indentation:

#define A(x, y) \
B(\
C(x),\
C(y)\
)\
/**/

Which is completely unreasonable.

Make sense?

Regards,
Paul Mensonides


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk