Boost logo

Boost :

From: jbandela_at_[hidden]
Date: 2001-10-18 12:56:57

I do not see how we can make the size 12.
If I understand your post, the each token_iterator would have
1) TokenizerFunction* (4 bytes)
2) Token (8 bytes if the iterator pair representation is used)

However, I think you also need to have the iterator that represents
the current position, because you can have multiple iterators active
over a given sequence.

I propose the following
1) tokenizer*
2) current_pos iterator
3) Token
In 32 bit implementations, this would give a size of 8 + sizeof(Token)
which would be 16 for token_representation.

Both the end iterator and the TokenizerFunction can be retrieved from
the tokenizer.

In regards to state information, instead of specializing templates,
make each TokenizerFunction typedef a state_type, like this. If it has
not non-constant state, it would typedef to EmptyState, which would be
an empty class. The tokenizer_base class would then inherit from the
state_type. This would allow the empty base optimization.

Also, about token_rep, it will not work for escaped_list_separator nor
will it work when the underlying_iterator is an input iterator.
However, even with the current implementation it is still possible to
use such a class by specifying it as the third template parameter to


John R. Bandela

> The more I think about it, the more that this seems like the right
> thing to do. But, this alone doesn't solve the problem of big
> iterators. I think what should happen is that the TokenizerFunction
> split up into the function proper (including static/closure-like
> state), and other state which is needed for the function. I guess
> I would be getting to would be a layout of a token_iterator which
> like this:
> typedef token_representation<underlying_iterator> value_type;
> value_type token;
> compressed_pair<TokenizerFunction *, State<TokenizerFunction> >;
> Where we give token_representation a fully-featured suite of
> string-like assignments, conversions, etc, but contains just the
> and end iterators and has a lifetime limited by the original parsed
> string (this substring class would probably be useful in other
> contexts, actually (regex::match_results comes to mind); it's a lot
> like the Range representations I've seen talked about too). The
> original TokenizerFunction object referred to is in the tokenizer
> object (and now can itself hold the start- and end-iterator of the
> original sequence, which is also currently being carried around by
> iterator, if I'm reading things correctly).
> Then we specialize things for the default case, where
> State<TokenizerFunction> is void, and otherwise,
> calls something like this:
> token = TokenizerFunction(token.end,State<TokenizerFunction> &);
> and the end token_iterator has start == the TokenizerFunction end of
> the original sequence.
> This would be a size 12 object for non-stateful iterators on most
> implementations, which is much more palatable than the current 40
> my implementation) or 44 on yours; and the copying is just three
> word-copies, so it will be pretty cheap, no ref-counting or anything
> worry about. As an added bonus, this representation would enable the
> offset_separators function to dispense with true state, as the
> TokenizerFunction could just compute the difference from the
> stored-one-time value in the tokenizer object.
> Now we have ForwardIterator status without any real trouble. We
> even consider allowing the user to define other functions (eg,
> backwards) if they wanted the iterator to satisfy Bidirectional or
> RandomAccess Iterator requirements, and the tokenizer function in
> was suitable (eg, offset_separator could very naturally have an
> iterator which satisfies RandomAccess requirements).
> So, thoughts? Is this worth hacking together an attempt at an
> implementation?
> George Heintzelman
> georgeh_at_a...

Boost list run by bdawes at, gregod at, cpdaniel at, john at