Boost logo

Boost :

From: Doug Gregor (gregod_at_[hidden])
Date: 2001-04-28 11:28:09


On Friday 27 April 2001 10:42, you wrote:
> I am requesting a formal review of the Tokenizer Package. You can
> find the code along with relevant docs in the Tokenizer directory of
> Files. It is all compressed in tokenizer.zip
>
>
> Cheers,
>
> John R. Bandela

I'm a bit unclear as to the semantics of TokenizerFunction.

* Given a TokenizerFunction "tokfn", if I make successive calls to
tokfn::operator() with no intervening reset, is it implicitly assumed that
the "end" argument to operator() will always be the same? That is, if I take
a string of length 100 and tokenize it given the index range [0, 100), will I
get the same result if I tokenize ten times over [0, 10), [10, 20), [20, 30),
etc? A survey of the supplied tokenizer functions hints at the answer "no":
for instance, the csv_separator does not track state between calls to
operator().

My concern is that the TokenizerFunction concept essentially dictates a
"pull" interface which would work well for strings & other containers
completely stored in memory, but may not be useful for asynchronous data
obtained from, for instance, a network socket.

In the case of the tokenizer, I believe that a "push" interface would be more
appropriate. That is, it is seen as a consumer of data that has the side
effect of producing tokens; alternatively, and probably more realistically,
it could be viewed as a filter converting a stream of characters into a
stream of tokens (token_iterator is exactly this already, but with specific
assumptions on the input stream). This would allow asynchronous operation and
would not harm the interface at all.

I would propose that the tokenizer_iterator's role change a bit. Instead of
being an input iterator over a sequence of characters, it should be an output
iterator adaptor. That is, it is an output iterator whose value_type is the
character type that wraps an output iterator whose value_type is the token
type.

I opt for this sort of interface because the tokenizer fits well with other
forms of parsers and scanners, and I think we should take this into an
account. One possibility is to consider a tokenizer_iterator as a
refinement/model of an "Acceptor" concept, where an Acceptor is a refinement
of an output iterator that:
        - has an "accepted" operation to determine if the Acceptor is in a state of
acceptance (think of an NFA or PDA). tokenizer_iterator would likely always
accept its input.
        - may have semantic actions at known points in the acceptance operation. A
token iterator would emit tokens, a parser would generate nonterminals, etc.

The usage of the token_iterator modification I'm describing is similar to the
one submitted:
copy(test_string.begin(), test_string.end(),
     make_tokenizer(csv_separator(), ostream_iterator<string>(cout, "|")));

Comments?

        Doug Gregor


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk