|
Boost : |
From: jbandela_at_[hidden]
Date: 2000-09-04 12:13:08
I agree with Aleksey. The problem of backtracking is a real one.
Consider code to process assignment
if(*iter->type == VAR){
// check if the next token is an =
iterator_type temp(++iter);
if(temp != end && temp->type == OP_EQ){
// Process the assignment
...
iter = temp;
}
// Otherwise do not change iter
}
Another problem is that you may not want to process the entire input,
but rather get say the first 2 tokens and grab the rest as a string.
For example you have ini type file parsing code that reads name value
pairs that are like this
name=value
We could have escape characters so that name can contain = signs
this\=test=Hello World=first program in C
To quickly parse this, we get the first two tokens ("this=test" and
=) and just grab the rest of the line ("Hello World=first program in
C") since we want to be able to support strings in value that have an
arbitrary fromat. An example might be base64 encoding that uses =
Thanks again for looking at the code and commenting
--- In boost_at_[hidden], "Aleksey Gurtovoy" <alexy_at_m...> wrote:
> Daryle Walker (<darylew_at_m...>) wrote:
>
>
> > I looked at the recent TokenIterator stuff, and I wonder if there
is a way
> > to simplify tokenization. Not every part of a concept has to be
a class;
> we
> > could replace the token iterator class with an algorithm. How
about:
> >
> [snip]
> > template <typename Tokenizer, typename In, typename Out>
> > Tokenizer
> > tokenize (
> > In src_begin,
> > In src_end,
> > Out dst_begin,
> > Tokenizer tok )
> > {
> > // Send any prefix tokens
> > while ( tok )
> > *dst_begin++ = *tok;
> >
> > while ( src_begin != src_end )
> > {
> > // Give input symbols to tokenizer
> > tok( *src_begin++ );
> >
> > // If a token can now be formed, send it
> > while ( tok )
> > *dst_begin++ = *tok;
> > }
> >
> > // Return the tokenizer in case more symbols are needed
> > return tok;
> > }
> >
>
> That's always good to look at something from different
perspective :).
> However, I don't think that replacing of token iterator concept by
some
> 'tokenize' algorithm would be beneficial. Actually, I don't think
that there
> is a common generic representation of all sequence parsing
algorithms, which
> we could factor out and turn into some (useful) 'tokenize' function
> template. For instance, the algorithm you posted pretty much rules
out
> backtracking tokenizers. The fact is that iteration through an
original
> input sequence that needs to be tokenized is too much tied to the
parsing
> algorithm, and I don't think there is much sense in breaking these
> dependences. So actually we don't care about how input sequence is
iterated
> during the parsing process - that's the tokenizer's work. What we
want is
> some standard way to get the results of its job, in the form that
doesn't
> impose unnecessary requirements on users' code and that integrates
well with
> the standard library itself. IMO, iterator concept is exactly what
we need.
> It doesn't force you to process the whole input sequence all at
once and put
> the results somewhere, although you could do it easily if you want
to. It
> directly supports the common pattern of many (hi-level) parsing
algorithms -
> read new lexeme -> go to some state -> do some work -> read new
lexeme. It
> doesn't make constraining assumptions about how tokenizers work;
and it
> allows tokenizers to have a minimal interface (e.g. tokenizer may be
> implemented as just a function). As for complexity of the current
> implementation (which concerns me too) - I hope it will be
simplified a lot
> after we nail down the concepts.
>
> --Aleksey
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk