Boost logo

Boost :

From: jbandela_at_[hidden]
Date: 2001-04-28 12:14:15


Thanks for your input. The semantics of TokenizerFunction are such
that the end argument does not change for a sequence, but function is
free to modify the begin argument. In regard to your question about
the one call versus the 10 calls, the answer depends on the data and
function. If there are no tokens that span the ranges, I believe it
should yield the same results with most TokenizerFunctions.

The reason I chose the pull model, is that I believe it better
integrates with the Standard Library and Boost. For example,
token_iterator is just a convenient wrapper for the
boost::token_adapter. In addition, the pull model allows use of the
STL algorithms both in implementation of the function and in any
further processing of the tokens. Finally the pull model allows for
incremental parsing, so that you do not have to parse the entire
sequence at once. The way it is set up, the TokenizerFunctions do
just enough work to parse the required token, but no more.

Cheers,

John R. Bandela

--- In boost_at_y..., Doug Gregor <gregod_at_c...> wrote:
>
> I'm a bit unclear as to the semantics of TokenizerFunction.
>
> * Given a TokenizerFunction "tokfn", if I make successive calls to
> tokfn::operator() with no intervening reset, is it implicitly
assumed that
> the "end" argument to operator() will always be the same? That is,
if I take
> a string of length 100 and tokenize it given the index range [0,
100), will I
> get the same result if I tokenize ten times over [0, 10), [10, 20),
[20, 30),
> etc? A survey of the supplied tokenizer functions hints at the
answer "no":
> for instance, the csv_separator does not track state between calls
to
> operator().
>
> My concern is that the TokenizerFunction concept essentially
dictates a
> "pull" interface which would work well for strings & other
containers
> completely stored in memory, but may not be useful for asynchronous
data
> obtained from, for instance, a network socket.
>
> In the case of the tokenizer, I believe that a "push" interface
would be more
> appropriate. That is, it is seen as a consumer of data that has the
side
> effect of producing tokens; alternatively, and probably more
realistically,
> it could be viewed as a filter converting a stream of characters
into a
> stream of tokens (token_iterator is exactly this already, but with
specific
> assumptions on the input stream). This would allow asynchronous
operation and
> would not harm the interface at all.
>
> I would propose that the tokenizer_iterator's role change a bit.
Instead of
> being an input iterator over a sequence of characters, it should be
an output
> iterator adaptor. That is, it is an output iterator whose
value_type is the
> character type that wraps an output iterator whose value_type is
the token
> type.
>
> I opt for this sort of interface because the tokenizer fits well
with other
> forms of parsers and scanners, and I think we should take this into
an
> account. One possibility is to consider a tokenizer_iterator as a
> refinement/model of an "Acceptor" concept, where an Acceptor is a
refinement
> of an output iterator that:
> - has an "accepted" operation to determine if the Acceptor is
in a state of
> acceptance (think of an NFA or PDA). tokenizer_iterator would
likely always
> accept its input.
> - may have semantic actions at known points in the acceptance
operation. A
> token iterator would emit tokens, a parser would generate
nonterminals, etc.
>
> The usage of the token_iterator modification I'm describing is
similar to the
> one submitted:
> copy(test_string.begin(), test_string.end(),
> make_tokenizer(csv_separator(), ostream_iterator<string>
(cout, "|")));
>
> Comments?
>
> Doug Gregor


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk