Boost logo

Boost :

From: David A. Greene (greened_at_[hidden])
Date: 2002-01-16 18:08:31


rogeeff wrote:

> First of all I would like to say that IMO it is odd even to discuss
> an ability to use Spirit for generic command line parser. It's like
> use a canon to kill a fly. For one It is very expensive and heavy and
> also I should drag it all over the place.

How do you know this? I cannot make a judgement because I haven't
used Spirit (yet). But I know from reading the Spirit mailing list
that Joel, et. al. have put in lots of thought on how to keep things
lightweight.

>>Nope. Spirit is targetted at any C++ programmer who needs to do
>>parsing, which is most of them. Even more so now in the age of the
>>internet and all it's protocols which need to be parsed.
>>
> So now iin any place where we were using tokenizer or regular
> expression I should pile up Spirit.

No, not always. Sometimes you just want a tokenizer to work with CSV
data or some similar thing like /etc/passwd data. There is really
no grammar there (or rather, it is a very simple one).

For almost anything non-trivial (read, not a delimited set of values)
one usually ends up wanting a formalized grammar specification
and a parser to go along with it, just to make things simpler in the
long run. This is my personal experience, so take it with a
grain of salt.

>>If a programmer doesn't know EBNF, then they are missing an
>>important piece of knowledge, since it's the standard for computer

>>langauge definition.
>
> Are you so sure? How is it in reallity?

It's a standard, all right. It's also well-known. Spirit deviates
from it in various areas. Tools like YACC have done the same thing
in the past. But Spirit is "close enough" for people familiar with
EBNF or YACC to make due.

Parsing is a common and important task in computer science.
Rudimentary familiarity with the available tools is a must. It's
not terribly difficult to grasp.

>>Spirit is aimed to be a general parsing framework. It is not aimed to
>>be a simple command line parser.
>>
>
> That's the point.

You missed the point. Spirit is flexible enough for many, many
parsing tasks, including implementation of the command-line parser.
One need not expose the Spirit interface to the programmer. But it
makes a great deal of sense to me to use Spirit to do the actual
parsing.

>>structured in such a way, that you only pay for what you use.
>
> How much line of includes it will add to use a Spirit to parse a more
> or less complex command line?

Depends what "complex" means, I suppose. I think people get way too
caught up arguing about overhead for command-line parsing. It's a
one-shot job so performance shouldn't be an issue. Size is certainly
a concern. I don't have a feel for how Spirit scales in that sense.

It would be nice to have some numbers. Joel, do the Spirit guys

working on command-line parsers have any size numbers they can share?

>>sense (to me anyway) to use spirit internally for the command line
>>parser implementation. But, that still restricts the user to
>>whatever option format(s) the command line parser supports.
>
> Why would I want to use the Spirit in the implementation? Rather than
> regexp or tokenizer? Is it that flexible that I can implement
> arbitrary parsing with it?

I don't know what you mean by "arbitrary parsing." Spirit is at
least as flexible as YACC (well, except for left-recursion,
probably :)).

regexp and tokenizer don't necessarily have enough power to do

the job. Consider the option format we use in our software:

--option1={--nestedOption1=value1
            --nestedOption2={--nestedNestedOption1
                             --nestedNestedOption2=value2}
            --nestedOption3}

Tokenizer ain't gonna help much with that. Regex won't either,
because we need to be able to match a _tree_, not a linear sequence
of characters. We also need to be able to perform various
actions during the matching or produce some structure (i.e. a
parse tree or AST) that allows us to post-process the matched
input.

Sure, it's _possible_ to parse with tokenizer and regex, in the
same sense that it's _possible_ to write "OO" code in assembler.
With tokenizer and regexp you'll end up writing a highly specialized
semantic action framework -- a framework that is already available
in Spirit.

For simple command-line specifications tokenizer may be sufficient.
It isn't for us. If the cost isn't too high (and that remains to
be seen), I see no reason not to use Spirit.

>>Learning how to use spirit is no different than learning a new API
>>or library.

> I would assume that command-line parser still will have MUCH more
> simpler interface.

And by trading off flexibility for simplicity that parser can
still have the same interface but be implemented with Spirit.
If we want to get really fancy we can augment the command-line
interface to allow more complex specifications, allowing the
library parser to be extended. I think this is crucial for
any command-line processor that is included in a library,
especially one like Boost. Honestly, how many of us actually
use getopt() regularly for anything but the most trivial
utilities?

>> Spirit makes parsing incredibly easy. Say, for instance you
>>had to write a function that would parse a complex number of the
>>form:
>>real, (real), or (real,imaginary) and store the real and imaginary
>>parts in 2 doubles:

>>With spirit, it's a one liner:
>>
>>return (
>> real_p[ref(real)]
>> | '('
>> >> real_p[ref(real)]
>> >> !(',' >> real_p[ref(imaginary)])
>> >> ')'
>> ).parse(str, str+strlen(str));

> I would say that it will take at least 10 min for maintanance
> programmer to grasp what is written here (and this is not counting
> understanding how its working).

I agree Spirit looks a little cryptic. In particular the assignment
of values is rather "magical" ("ref" should probably be named
"assign_to"). But even so, as someone who has experience with YACC
but zero with Spirit, I can follow this and understand what it means
(except for the bang, which I had to look up, but it makes sense
if you consider it an "|" with an empty left operand).

Now that I've taken 5 minutes of my time to look up two things
I wasn't compeletely familiar with, I can understand almost any
Spirit grammar.

> I would implement the same logic in 2-3 lines using tokenizer or

> regexp. Something like this:

> token_iterator it( str, " \t," );
>
> real = lexical_cast<double>( *it++ );
> imaginary = lexical_cast<double>( *it );

Well, complex numbers are not a really good example to show why a
parser is useful because they are just a CSV structure with some
optional syntactic sugar on the ends.

Even so, this tokenizer example is at least as hard to understand
as the Spirit example. How are parens and the comma ignored? All
I see in a quick glance through the tokenizer documentation is that
by default "punctuation" is skipped. What "punctuation" means is
not immediately obvious, nor is a definition of the default
punctuation set easily found.

I don't see a token_iterator(char *, char *) constructor. I don't
even see a "token_iterator" declared anywhere. Are you sure your
example works? Am I missing something?

Tokenizer and regexp are essentially scanners. Doing something
moderately interesting usually requires something more to make
things easier.

                            -Dave

-- 
"Some little people have music in them, but Fats, he was all music,
  and you know how big he was."  --  James P. Johnson

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk