Boost logo

Boost :

From: Aleksey Gurtovoy (alexy_at_[hidden])
Date: 2000-09-03 04:30:03


Daryle Walker (<darylew_at_[hidden]>) wrote:

> I looked at the recent TokenIterator stuff, and I wonder if there is a way
> to simplify tokenization. Not every part of a concept has to be a class;
we
> could replace the token iterator class with an algorithm. How about:
>
[snip]
> template <typename Tokenizer, typename In, typename Out>
> Tokenizer
> tokenize (
> In src_begin,
> In src_end,
> Out dst_begin,
> Tokenizer tok )
> {
> // Send any prefix tokens
> while ( tok )
> *dst_begin++ = *tok;
>
> while ( src_begin != src_end )
> {
> // Give input symbols to tokenizer
> tok( *src_begin++ );
>
> // If a token can now be formed, send it
> while ( tok )
> *dst_begin++ = *tok;
> }
>
> // Return the tokenizer in case more symbols are needed
> return tok;
> }
>

That's always good to look at something from different perspective :).
However, I don't think that replacing of token iterator concept by some
'tokenize' algorithm would be beneficial. Actually, I don't think that there
is a common generic representation of all sequence parsing algorithms, which
we could factor out and turn into some (useful) 'tokenize' function
template. For instance, the algorithm you posted pretty much rules out
backtracking tokenizers. The fact is that iteration through an original
input sequence that needs to be tokenized is too much tied to the parsing
algorithm, and I don't think there is much sense in breaking these
dependences. So actually we don't care about how input sequence is iterated
during the parsing process - that's the tokenizer's work. What we want is
some standard way to get the results of its job, in the form that doesn't
impose unnecessary requirements on users' code and that integrates well with
the standard library itself. IMO, iterator concept is exactly what we need.
It doesn't force you to process the whole input sequence all at once and put
the results somewhere, although you could do it easily if you want to. It
directly supports the common pattern of many (hi-level) parsing algorithms -
read new lexeme -> go to some state -> do some work -> read new lexeme. It
doesn't make constraining assumptions about how tokenizers work; and it
allows tokenizers to have a minimal interface (e.g. tokenizer may be
implemented as just a function). As for complexity of the current
implementation (which concerns me too) - I hope it will be simplified a lot
after we nail down the concepts.

--Aleksey


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk