Boost :

Date view	Thread view	Subject view	Author view

From: todor_at_[hidden]
Date: 2001-10-18 08:40:10

Next message: moshebox_at_[hidden]: "RegEx : Matching a list of predefined tokens"
Previous message: Peter Dimov: "Re: [boost] Re: Re: lexical_cast and gcc 2.96"
In reply to: George A. Heintzelman: "tokenizer profiling results"
Next in thread: George A. Heintzelman: "Re: [boost] tokenizer profiling results"
Reply: George A. Heintzelman: "Re: [boost] tokenizer profiling results"

From: "George A. Heintzelman" <georgeh_at_[hidden]>

> I was spending some time staring at my profiling results, and found
> something a little bit disturbing in the tokenizer.
>
> The token_iterators ultimately contain a copy of the policies
object,
> which contains the function which decides where the tokens begin and
> end. Some manner of reference to the function is necessary, of
course.
> However, the iterator contains the policy by value. The policy is
> allowed to have state.
>
> However, there is no distinguishing between the aspects of the
policy
> which are constant and unchanging, as with the strings used to hold
the
> separator lists in char_delimiters_separator and
escaped_list_separator,
> and the aspects which really are stateful, as with the
current_offset
> in the offset_separator.
>
> So, that means that whenever I copy a token_iterator, I wind up
making
> two unnecessary string copies.
>
> Most of the STL algorithms take iterators by value, sometimes
> recursively, assuming that iterators are cheap to copy....
>

I've had the same concerns. I've found that on my implementation
sizeof(token_iterator<char_delimiters_separator>) == 44. This
is really gross.

I've ended with the solution with low cost of copying, at
the expense of "degradation" of token iterator from
ForwardIterator to InputIterator. The "degradation" occurs
because the tokenizer function instance and the current token
are kept in the state object that is shared between iterator
instances. As usual, the "additional level of indirection"
approach (original code slightly edited for brevity):

template <
class TokenizerFunc,
class Iterator,
class Type
>
class token_stream : TokenizerFunc
{
public:
struct iterator;
friend struct iterator;
struct iterator : std::iterator<std::input_iterator_tag, Type>
{
token_stream* state;

  iterator():state(0)
  {}
  iterator(token_stream* p)
   :state(p)
  {}

  Type const& operator*() const
  {
   assert(state);
   return state->dereference();
  }

  Type const* operator->() const
  {
   assert(state);
   return &state->dereference();
  }

  iterator& operator++()
  {
   assert(state);
   if(!state->increment())
    state = 0;
   return *this;
  }

  iterator operator++(int)
  {
   assert(state);
   iterator i(*this);
   operator++();
   return i;
  }

  bool operator==(iterator rhs) const
  {
   assert((!state && !rhs.state) || (state && !rhs.state) || (!state
&& rhs.state) || (state == rhs.state));
   return state == rhs.state;
  }

  bool operator!=(iterator rhs) const
  {
   return !operator==(rhs);
  }
};

token_stream(
    Iterator first,
    Iterator last,
    TokenizerFunc const& f = TokenizerFunc())
  :m_begin(first)
  ,m_end(last)
  ,TokenizerFunc(f)
{
  initialize();
}

template<class Container>
token_stream(
    Container const& c)
  :m_begin(c.begin())
  ,m_end(c.end())
  ,TokenizerFunc()
{
  initialize();
}

template<class Container>
token_stream(
   Container const& c,
   TokenizerFunc const& f)
  :m_begin(c.begin())
  ,m_end(c.end())
  ,TokenizerFunc(f)
{
  initialize();
}

iterator begin()
{
  if(m_valid)
   return iterator(this);
  else
   return iterator();
}

iterator end()
{
return iterator();
}

TokenizerFunc& func()
{
return static_cast<TokenizerFunc&>(*this);
}

private:
Iterator m_begin;
Iterator m_end;
bool m_valid;
Type m_tok;

void initialize()
{
func().reset();
m_valid = (m_begin != m_end)? func()(m_begin, m_end, m_tok) : false;
}

bool increment()
{
assert(m_valid);
return m_valid = func()(m_begin, m_end, m_tok);
}

Type const& dereference() const
{
assert(m_valid);
return m_tok;
}
};

I've had no problems with this approach. Of cause, there are
some life-time issues (the life-time of the token_stream must
exceed the life-time of its iterators), but the situation is
the same with any container. The only real problem is that
iterators are no longer ForwardIterators. I *think* that this
problem can be solved by redesigning the library to
distinguish between const and mutable tokenization state, but
I have to admit that I haven't put much thought in this direction.

Thinking a bit more, in what usage scenarios token iterators will
be passed to algorithms that require ForwardIterators? Does any
one have some examples? I haven't run into such cases so far.

Todor

Next message: moshebox_at_[hidden]: "RegEx : Matching a list of predefined tokens"
Previous message: Peter Dimov: "Re: [boost] Re: Re: lexical_cast and gcc 2.96"
In reply to: George A. Heintzelman: "tokenizer profiling results"
Next in thread: George A. Heintzelman: "Re: [boost] tokenizer profiling results"
Reply: George A. Heintzelman: "Re: [boost] tokenizer profiling results"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk