Boost logo

Boost :

From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2008-02-14 12:50:10


Joe Gottman wrote:
> I just discovered that the string algorithm predicate is_any_of has an
> inefficient implementation. Internally it uses a std::set.

It would be nice if things like this could be at least as efficient as
the equivalent C functionality. I've just done some quick benchmarks
of possible is_any_of implementations. My test splits my
/usr/share/dict/words at every lower-case vowel; it's about a megabyte
and splits into 300,000 pieces.

First using std::set:

struct stdset_is_any_of {
   std::set<char> s;
   stdset_is_any_of(string pat) {
     for(string::const_iterator i=pat.begin(); i!=pat.end(); ++i) {
       s.insert(*i);
     }
   }
   bool operator()(char c) {
     return s.find(c) != s.end();
   }
};

Relative performance is 0.6.

For small search sets, just storing the string and looking though it
each time should be faster. This is of course linear in the search set
size, rather than logarithmic:

struct str_is_any_of {
   string pat;
   str_is_any_of(string pat_): pat(pat_) {}
   bool operator()(char c) {
     return pat.find(c) != string::npos;
   }
};

Relative performance is 3.1. Using strchr on pat.c_str() is a bit
faster at 3.4.

For larger search sets another possibility, for 8-bit characters, is to
use a bit-set:

struct bitset_is_any_of {
   uint32_t bits[8];
   bitset_is_any_of(string pat) {
     fill(bits,bits+8,0);
     for(string::const_iterator i=pat.begin(); i!=pat.end(); ++i) {
       char c = *i;
       bits[c>>5] |= (1<<(c&0x1f));
     }
   }
   bool operator()(char c) {
     return bits[c>>5] & (1<<(c&0x1f));
   }
};

In practice this seems to work well even for small search sets like my
aeiou; its relative performance is 4.0.

Further investigation with other workloads suggests that this bitset
really works very well, and I note that it would be possible to write a
specialisation of std::set<char> using it. This is probably worth pursuing.

In each case, I've used the predicate with std::find_if. But if I
instead use C's strcspn() to do the whole job, I see a relative
performance for this example of 20. I find it depressing that the C++
code doesn't come close to that.

I think that every time I've used is_any_of or similar features the set
of characters has been a compile-time constant. So is it possible to
make the characters a template parameter, e.g. is_any_of<"ABC">? Sadly
it seems that strings can't be template parameters (though apparently
pointers can be; what is that used for?). But you can write this:

template <char c0> bool is_any_of(char c) {
   return (c==c0);
}

template <char c0, char c1> bool is_any_of(char c) {
   return (c==c0) || is_any_of<c1>(c);
}

template <char c0, char c1, char c2> bool is_any_of(char c) {
   return (c==c0) || is_any_of<c1,c2>(c);
}

// etc.

if (is_any_of<'a','b'>(ch)) ....

Presumably C++0x varadic templates make that simpler.

Using this sort of technique with std::find_if I get a relative
performance for my benchmark of 28 - finally better than C.

These numbers are all with gcc 4.1 -O3 on Linux. I hope the above is
of interest to someone; having done the work it would a shame not to
share the results. Joe, I guess that in your case the overhead of
copying the predicate and constructing the result sequence will
dominate over the performance of the predicate itself.

Cheers, Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk