Boost logo

Boost :

From: David Abrahams (dave_at_[hidden])
Date: 2004-12-26 11:20:34


Joel de Guzman wrote:
> David Abrahams wrote:
>
>> Well, I say "you could use first/follow to prune unproductive parses
>> while still using dynamic failure conditions to get nondeterminism," and
>> you come back with "Classic LL(1) doesn't have those." I assume you're
>> using that as an argument against my statement, but it's a hollow
>> argument because classic LL(1) is only one way to use first/follow with
>> recursive descent.
>
> Oh. I see.
>
> No, I'm not using it as an argument, and no, I am not disputing
> your statement. That's the very reason I qualified my statement
> with "classic LL(1)". Actually, we were in agreement (or at least I
> thought so).

I guess I was confused (and still am, if you say you've been agreeing
with me) by this little exchange:

Me:
  Okay, either I'm "right" and first-follow can be used to prune
  unproductive parses in a nondeterministic parse *or* using
  first-follow makes a parser deterministic. It can't be both :)

You:
  Using first-follow makes a parser deterministic, it can't be both.

But I'm happy to drop this now. It isn't crucial.

>>>>Hum? The algorithm for computing the first and follow sets of a symbol
>>>>don't have an "otherwise the computation fails" branch, and the
>>>>definitions of these sets are simple:
>>>>
>>>> first(a) = the set of tokens that can begin strings derived from a.
>>>> follow(N) = the set of tokens that can follow N in grammatical sequences
>>>>
>>
>>
>> So "first follow computation will fail" is false. BTW, I don't know
>> what NLL(1) means. Nondeterministic LL(1)? I can't even picture that.
>
> No, it is true.
>
> BTW, the "NLL" is a typo introduced by your quoting of my post.

<blush>Sorry!</blush>.

>
> When the grammar is ambiguous, the computation of the first set
> will fail because the set itself cannot have multiple same entries.

It's hard for me to understand how you reach that conclusion. When an
algorithm says to add something to a set that is already in the set, the
normal expectation is that the set remains unchanged and the algorithm
proceeds, not that it "fails."

> You'll have to look-ahead more than 1 tokens to resolve the ambiguity,

That's an entirely different question than whether the computation of
the first and follow sets fails.

> and by doing so, the grammar is no longer LL(1).

I've tried to say several times that the properties of LL(1) are
irrelevant to the arguments I've been making, and you always seem to
eventually agree, but here it is again. What is your point in bringing
up LL(1) here?

>>>>Maybe we're talking about the same thing.
>>>
>>>Yes! That was what I was trying to point out from the start.
>>>However, it's not as simple as it seems.
>>
>>
>> Why not?
>
> Long story. Anyway, the basic problem is that Spirit parsers
> are dynamic. The computation of the first set, for instance,
> cannot be done at compile time.

Sure, but I thought you were going to be mostly emphasizing a static
subrule model. I guess if you use string literals you'll have this
problem no matter what, though I'm not sure the use of string literals
should be a foregone conclusion. If you consider parser generators like
YACC, they cleanly separate lexical analysis from parsing and the parser
itself is never working with string literals.

> Say for example:
>
> r = str_p("apple") | "banana" | "orange";
>
> the first set of r is: ['a', 'b', 'o']. This can be obtained
> through a call to first(r) which calls first("apple"),
> first("banana") and first("orange") recursively.

Right.

> The problem is that each alternative can potentially have
> different types. A production such as:
>
> a | b | c;

(nit: that's a RHS, not a production. You need a LHS symbol to make it
a production).

> can be thought of as a
>
> composite<alternative_operation, tuple<A, B, C> >
>
> where A, B and C are the types of a, b and c. So, now, when
> creating the parser, you'll have to make a table of alternates
> tuple<A, B, C> which dispatches properly based on information
> known only at runtime: 1) the current input character and
> 2) the first follow sets of A, B and C.
>
> One possible solution is to make the parsers truly static.
> Example:
>
> str<'a','p','p','l','e'>()
> | str<'b','a','n','a','n',','a'>()
> | str<'o','r','a','n','g','e'>()
>
> in this case, first/follow can be a static operation. But, I'm not
> thrilled about this solution. It has lots of problems too (e.g.
> the rule's opaque virtual function wall).

Again, only an issue when crossing the static/dynamic boundary.

In YACC this would be:

  enum tokens { apple, bananna, orange };

Actually, I rather like

  typedef str<'a','p','p','l','e'> apple;

for Spirit. Most grammars don't have billions of tokens. BTW, is

  typedef str<'appl','e'> apple;

legal C++? I'm just curious about multi-char literals.

> At the opposite extreme, another possible solution is to use virtual
> functions and run-time dispatch all over. Hence, the productions can
> be stored in standard homogeneous containers instead of heterogenious
> containers (tuples). I'm also not thrilled about this solution.

Agreed. I don't really see why your previous idea was so bad in the
case where you're *not* using runtime dispatch all over.

> Searching for a viable solution to this problem is actually
> one of the main motivations behind the development of Fusion. So,
> IOTW, I had to develop some basic infrastructure/scaffoldings first
> before I could tackle predictive parsing in Spirit. Hence, it wasn't
> as easy at it seems.

<vague understanding> OK.

>> I once sent you a comparative study of some NLP parsing strategies. One
>> of those was using first/follow computation and parsing ambiguous
>> grammars. I'm pretty sure they were using backtracking or that they
>> told me that they had thought of adding it.
>
> Cool! I'll look into that.

I doubt you will find any specifics in the material I sent you, though.

>>>Not really. There's a big problem with backtracking in general:
>>>backtracking and semantic actions do not mix well (e.g. false
>>>triggering).
>>
>>
>> It's not a problem if you build a parse tree and run the analysis on
>> that (which is sometimes required for synthesized attributes anyway).

Sorry, I meant inherited attributes.

> Agreed. Yet, for that to succeed, you'll have to build it all the
> way to the top (start rule). Only then will all correct branches
> be known and all failed branches be thrown away. For simple parsing
> tasks, this is expensive. In a lot of cases, I do not want to build
> a parse tree.
>
>>>In a syntactic and semantic predicate, all actions
>>>are intentionally inhibited.
>>
>>
>> Have you considered the strategy of
>> http://docs.biostat.wustl.edu/cgi-bin/info2html?(bison.info.gz)GLR%2520Parsers
>> (http://tinyurl.com/694kq)
>> ??
>
> No. At least, not just yet. I'd like to concentrate first on top-bottom
> RD approaches. I have a keen eye on recursive ascent, however.

Hm. I didn't *think* their strategy for managing nondeterministic
semantic actions was specific to bottom-up parsing. Is it??

Let me also mention that taking throw-away semantic actions is not so
bad when:

  1. The actions are all stateless and just produce synthesized
     attributes

  2. First/Follow pruning eliminates most parse explorations that
     later turn out to fail anyway.

-- 
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk