Boost logo

Boost :

From: Robert Zeh (razeh_at_[hidden])
Date: 2003-11-04 17:31:16


We use boost::tokenizer extensively in our internal architecture to
parse incoming messages. Some of the messages are grouped into blocks
of about 1000 characters, and we use the tokenizer to break them out
into std::basic_strings of about 100 characters.

Under gcc 3.3.2 on Solaris 2.8 we've (well, one of our developers,
John Flanagan) found that building up tokens a character at a time
(with operator +=) is slower then creating the tokens using an
iterator range.

The following specialization of operator()
(InputIterator,InputIterator, Token) in char_separator provides a
nice speedup:

    template <typename InputIterator>
    bool operator()(InputIterator& next, InputIterator end, string_type& tok)
    {
      tok = string_type();

      // skip past all dropped_delims
      if (m_empty_tokens == drop_empty_tokens)
        for (; next != end && is_dropped(*next); ++next)
          { }
      
      if (m_empty_tokens == drop_empty_tokens) {

        if (next == end)
          return false;

        // if we are on a kept_delims move past it and stop
        if (is_kept(*next)) {
          // tokn += *next;
          tok = string_type(1, *next);
          ++next;
        } else {
          // append all the non delim characters
          InputIterator start(next);
          for (; next != end &&
                 !is_dropped(*next) && !is_kept(*next); ++next)
           // tok += *next;
            ;
          tok = string_type(start, next);
        }
      }
      else { // m_empty_tokens == keep_empty_tokens
        
        // Handle empty token at the end
        if (next == end)
          if (m_output_done == false) {
            m_output_done = true;
            return true;
          } else
            return false;

        if (is_kept(*next)) {
          if (m_output_done == false)
            m_output_done = true;
          else {
            // tok += *next;
            tok = string_type(1, *next);
            ++next;
            m_output_done = false;
          }
        }
        else if (m_output_done == false && is_dropped(*next)) {
          m_output_done = true;
        }
        else {
          if (is_dropped(*next))
            ++next;
          InputIterator start(next);
          for (; next != end && !is_dropped(*next) && !is_kept(*next); ++next)
            // tok += *next;
            ;
          tok = string_type(start, next);
          m_output_done = true;
        }
      }
      return true;
    }

I believe that the specialization may not be quite complete, and that
the iterators might need to be specialized along with the Token type.
If this modification is of interest, I am hoping that someone can let
me know if the iterators needs to be specialized as well.

We have done some timing experiments with our proprietary code that
show very large speed ups. Creating the string tokens with an
iterator range rather than operator += reduces the time spent in the
tokenizer to almost nothing.

For public use I've also done a very small timing experiment with
char_sep_example_3.cpp. For timing purposes I've removed the out
operations and forced it to tokenize the same string 1000000 times.

Without the specialization my test program takes 46 seconds to
complete. With the specialization the test program takes 33.
I'm using gcc 3.3.2 with -O3, on a 500 Mhz Ultrasparc.

Robert Zeh


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk