Boost logo

Boost :

From: Daryle Walker (darylew_at_[hidden])
Date: 2000-09-02 00:54:03


I looked at the recent TokenIterator stuff, and I wonder if there is a way
to simplify tokenization. Not every part of a concept has to be a class; we
could replace the token iterator class with an algorithm. How about:

//==========================================================================
template <typename Tokenizer, typename In, typename Out>
Tokenizer tokenize( In src_begin, In src_end, Out dst_begin, Tokenizer tok
= Tokenizer() );

template <typename Cap, typename Tokenizer, typename In, typename Out>
Tokenizer tokenize_final( In src_begin, In src_end, Out dst_begin,
Tokenizer tok, Cap capper = Cap() );

//...

template <typename Tokenizer, typename In, typename Out>
Tokenizer
tokenize (
    In src_begin,
    In src_end,
    Out dst_begin,
    Tokenizer tok )
{
    // Send any prefix tokens
    while ( tok )
        *dst_begin++ = *tok;

    while ( src_begin != src_end )
    {
        // Give input symbols to tokenizer
        tok( *src_begin++ );

        // If a token can now be formed, send it
        while ( tok )
            *dst_begin++ = *tok;
    }

    // Return the tokenizer in case more symbols are needed
    return tok;
}

template <typename Cap, typename Tokenizer, typename In, typename Out>
Tokenizer
tokenize_final (
    In src_begin,
    In src_end,
    Out dst_begin,
    Tokenizer tok,
    Cap capper )
{
    // Send any prefix tokens.
    while ( tok )
        *dst_begin++ = *tok;

    while ( src_begin != src_end )
    {
        // Give input symbols to tokenizer.
        tok( *src_begin++ );

        // If a token can now be formed, send it.
        while ( tok )
            *dst_begin++ = *tok;
    }

    // Notify the tokenizer that no more input symbols exist.
    // This lets the tokenizer send any postfix tokens.
    capper( tok );
    while ( tok )
        *dst_begin++ = *tok;

    // Return the tokenizer for any final analyses.
    return tok;
}
//==========================================================================

I allow the possiblity that a tokenizer object may want to send tokens to
the output before or after reading the input symbols. You could potentially
skip using tokenize_final if there's no post-processing. Since these are
template functions, you don't have to explicitly specify any types unless
you desire it.

The operations the tokenizer class has to support are:

- A Boolean conversion (bool, const void *, etc.) to indicate when at least
one output token is ready
- A dereference operation to copy the next token to the output (valid only
when the Boolean conversion returns true)
- A parameter entry to get and process the next symbol from the input

Should I try to formulate this into a concrete example?

-- 

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk