Boost logo

Boost :

Subject: Re: [boost] [preprocessor] Sequences vs. All Other Data Structures
From: Paul Mensonides (pmenso57_at_[hidden])
Date: 2012-04-23 01:04:32


On Sun, 22 Apr 2012 21:51:06 -0400, Edward Diener wrote:

> On 4/22/2012 8:43 PM, Stephan T. Lavavej wrote:

>> If someone could prepare *minimized* test cases demonstrating the VC
>> bugs that are holding you back here, I could file compiler bugs.
>
> One of the main issues is that Microsoft follows the C89 standard when
> it comes to the preprocessor. That has been superseded in both C and C++
> a long time ago. So what does one say to a company that follows a very
> old standard and defends the inadequacy of its preprocessor by saying
> that according to that standard they are processing preprocessor symbols
> correctly ? It is like a company saying that they implement a computer
> language of 23 years ago and despite the fact that the language has
> changed drastically since that time, they are proud to not keep up with
> those changes.

It isn't that. VC++ doesn't even implement the preprocessor correctly for
C89. Essentially, the macro expansion algorithm appears to be
fundamentally broken. One small example:

#define A() 123

#define B() ()

A B() // should expand to A() (and does)

#define C() A B()

C() // should *still* expand to A() (but instead expands to 123)

The entire methodology is borked here. Leaving blue-paint aside (and
argument processing aside), macro expansion should work like a stream
editor.

+ + C ( ) + +
^

+ + C ( ) + +
  ^

+ + C ( ) + +
    ^

+ + A B ( ) + +
    ^

THIS is "rescanning and further replacement." Rescanning and further
replacement is *not* a recursive process that occurs "in the replacement
list."

+ + A B ( ) + +
      ^

+ + A ( ) + +
      ^ // note: not at 'A', scanning doesn't go backward

+ + A ( ) + +
        ^

+ + A ( ) + +
          ^

+ + A ( ) + +
            ^

+ + A ( ) + +
              ^

In the above, everything to the left of the caret is output and everything
to the right is input.

For the top level source, that output is the underlying language parser
(or preprocess-only output). The only time it isn't is when an argument
to a macro--which is *actually used* at least once in the replacement list
without being an operand of # or ##--is processed. In that case, a
recursive scan of the sequence of tokens making up the argument is
necessary. In that case, the output of the scan is redirected to a buffer
(depending on exact implementation methodology) for future substitution
into a copy of the replacement list.

Things get more complicated when dealing with blue paint and disabling
contexts. A disabling context is a range associated with a macro name
(that exists only during a scan for macro expansion--i.e. if that scan is
redirected to buffer for substitution, the result in the buffer does not
contain annotations representing the disabling context). If an identifier
token which matches the macro name is found in the range during the scan,
it is "painted blue." A painted token cannot be used as the name of a
macro in a macro invocation. However, while the context that causes blue
paint is transient, blue paint on a token is not. It remains even when
the scan output is redirected for substitution. Note that substitution
ultimately means that the sequence of tokens in the argument are going to
get scanned again. When that happens, the contexts from the first scan
are gone, but any blue paint on particular tokens is not.

As a concrete example,

#define A(x) B(x) C(x) D(x)
#define B(x)
#define C(x) x
#define D(x) #x

A(B(1))

Where M' = painted M token, { ... } is the hideset H (which may be flags
in the symbol table), +M is a virtual token which means H := H | M, and -M
is a virtual token which means H := H \ M...

[begin]
A ( B ( 1 ) ) EOF
^ {}
    [begin]
    B ( 1 ) EOF
    ^ {}
        [begin]
        1 EOF
        ^ {}
        1 EOF
          ^ {}
        [end]
    +B 1 -B EOF
    ^ {}
    1 -B EOF
    ^ { B }
    1 -B EOF
      ^ { B }
    1 EOF
      ^ {}
    [end]
+A "B(1)" B ( 1 ) C ( 1 ) D ( 1 ) -A EOF
^ {}
"B(1)" B ( 1 ) C ( 1 ) D ( 1 ) -A EOF
^ { A }
"B(1)" B ( 1 ) C ( 1 ) D ( 1 ) -A EOF
       ^ { A }
    [begin]
    1 EOF
    ^ { A }
    1 EOF
      ^ { A }
    [end]
"B(1)" +B 1 -B C ( 1 ) D ( 1 ) -A EOF
       ^ { A }
"B(1)" 1 -B C ( 1 ) D ( 1 ) -A EOF
       ^ { A, B }
"B(1)" 1 -B C ( 1 ) D ( 1 ) -A EOF
         ^ { A, B }
"B(1)" 1 C ( 1 ) D ( 1 ) -A EOF
         ^ { A }
"B(1)" 1 +C -C D ( 1 ) -A EOF
         ^ { A }
"B(1)" 1 -C D ( 1 ) -A EOF
         ^ { A, C }
"B(1)" 1 D ( 1 ) -A EOF
         ^ { A }
    [begin]
    1 EOF
    ^ { A }
    1 EOF
      ^ { A }
    [end]
"B(1)" 1 +D "1" A ( 1 ) B ( 1 ) -D -A EOF
         ^ { A }
"B(1)" 1 "1" A ( 1 ) B ( 1 ) -D -A EOF
         ^ { A, D }
"B(1)" 1 "1" A ( 1 ) B ( 1 ) -D -A EOF
             ^ { A, D }
"B(1)" 1 "1" A' ( 1 ) B ( 1 ) -D -A EOF
                ^ { A, D }
"B(1)" 1 "1" A' ( 1 ) B ( 1 ) -D -A EOF
                  ^ { A, D }
"B(1)" 1 "1" A' ( 1 ) B ( 1 ) -D -A EOF
                    ^ { A, D }
"B(1)" 1 "1" A' ( 1 ) B ( 1 ) -D -A EOF
                      ^ { A, D }
    [begin]
    1 EOF
    ^ { A, D }
    1 EOF
      ^ { A, D }
    [end]
"B(1)" 1 "1" A' ( 1 ) +B 1 -B -D -A EOF
                      ^ { A, D }
"B(1)" 1 "1" A' ( 1 ) 1 -B -D -A EOF
                      ^ { A, B, D }
"B(1)" 1 "1" A' ( 1 ) 1 -D -A EOF
                        ^ { A, D }
"B(1)" 1 "1" A' ( 1 ) 1 -A EOF
                        ^ { A }
"B(1)" 1 "1" A' ( 1 ) 1 EOF
                        ^ {}
[end]

VC++ gets the result correct here (though it doesn't get it by the correct
method).

The algorithm is bizarre in relation to typical code interpreters, but it
is actually shockingly simple. The only times that it gets tricky are
when a -M occurs within the sequence of tokens that makes up an invocation
which I glossed over above (because they don't occur in the above). E.g.
scenarios like:

MACRO -M ( ARG )

or

MACRO ( ARG, -M ARG )

In both of these cases, the -M must get executed prior to MACRO(...) being
replaced by MACRO's replacement list (and resuming scanning).

The first of these is not very tricky--and both boost-pp and chaos-pp
require this. The algorithm merely has to execute the virtual token when
finding the left parentheses. The second is more tricky (and I don't
think even chaos-pp requires this one to work correctly) because the
algorithm has to *correctly* deal with possibly four different scenarios:

1. The argument is not used (any -M must still be executed therein).
2. The argument is used as an operand of # (any -M must still be executed
therein).
3. The argument is used as an operand of ## (any -M must still be executed
and tokens must be painted as required).
4. The argument is used as a non-operand of # or ## (recursive scan)

Worse, 2, 3, and 4 can all be used at the same time:

#define A(x) #x x ## x x

However, in 2 and 3, you cannot encounter a +M, so you can just push a +M
into a "reset" stack for every -M you find and then execute the reset
stack prior to processing an argument (provided the recursive scan case is
done last).

VC++'s macro expansion algorithm appears to be a combination of
optimizations and hacks that yield correct results for simple cases, but
often yields incorrect results where the input requires the actual rules
to be followed. Obvious examples are the way __VA_ARGS__ is "optimized"
into a single argument and the extra scan applied in the following causing
it to yield 123 instead of A()--which is presumably a hack of some kind.

#define A() 123
#define B() ()
#define C() A B()

C()

However, these are only the tip of the iceberg. I have a suspicion that
the algorithm used is actually totally unlike what it is supposed to be.
I.e. I'd guess it's a re-write-the-algorithm rather than a fix-a-few-bugs
scenario.

> A second issue is that quite often simple test cases are unavailable
> since the sort of failures exhibited by Microsoft's preprocessor occur
> when one is attempting to use the facilities of the Boost PP library and
> that is often at the cutting edge of what one can do with the
> preprocessor, aside from Chaos which assumes a 100% compliant
> preprocessor.

Edward is correct here. The problems often occur in a combinatorial
fashion. This is almost certainly due to the actual algorithm operating
in way very foreign to how it should, so when simple examples are used,
the output appears correct except that you cannot see the algorithm state
(i.e. hidesets, blue paint, etc.) which is often *not* correct. So, when
things are used in combination, what appears to be correct in isolation is
actually carrying forward with algorithm state that affects further
processing incorrectly. These types of scenarios are inordinately
difficult to find workarounds for, and any workarounds are inherently
unstable (because you don't know exactly what they are "fixing"). All it
really takes is some different input or different combination and things
might break. Chaos has over 500 primary interface macros and over two
thousand primary and secondary interface macros (i.e. SEQ_FOR_EACH is
primary, SEQ_FOR_EACH_S is secondary (derivative))--not to mention the
number of implementation macros (which is probably in the 10000 range,
though I haven't counted). Full circle, even if it was a matter of
testing permutations and combinations of primary interfaces used only
once, eΓ(500 + 1, 1) is still about 3.3 x 10^1134 possible combinations.
Aka, not tenable.

The way boost-pp deals with this is essentially by having a complete
separate implementation for MSVC which tries to do everything it can to
force things to happen when they should or before it is too late--which
frequently doesn't work.

<rant>
Since I've been doing this pp-stuff, which is well over a decade, I've
seen numerous compiler and tool vendors fix or significantly improve their
preprocessors--several of which have informally consulted with me.
However, I have seen absolutely zero effort by MS. The stock response for
a bug report is "closed as won't fix" for anything related to the
preprocessor. I've heard--sometimes on this list--numerous people who
represent MS say that "they'll see what they can do," but nothing ever
happens. Fix your preprocessor, MS, which will probably gain you nothing
financially, and stop implementing extensions to the language in lieu of
implementing the language itself, and I'll actually maybe believe that you
care about C++, standards, and ethics.
</rant>

Regards,
Paul Mensonides


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk