Boost logo

Boost :

From: Giovanni Piero Deretta (gpderetta_at_[hidden])
Date: 2008-02-15 06:46:01


On Thu, Feb 14, 2008 at 6:50 PM, Phil Endecott
<spam_from_boost_dev_at_[hidden]> wrote:
> Joe Gottman wrote:
> > I just discovered that the string algorithm predicate is_any_of has an
> > inefficient implementation. Internally it uses a std::set.
>
> It would be nice if things like this could be at least as efficient as
> the equivalent C functionality. I've just done some quick benchmarks
> of possible is_any_of implementations. My test splits my
> /usr/share/dict/words at every lower-case vowel; it's about a megabyte
> and splits into 300,000 pieces.
>
> First using std::set:
> [...]
>
> Relative performance is 0.6.
>
> For small search sets, just storing the string and looking though it
> each time should be faster. This is of course linear in the search set
> size, rather than logarithmic:
> [...]
> Relative performance is 3.1. Using strchr on pat.c_str() is a bit
> faster at 3.4.
>
> For larger search sets another possibility, for 8-bit characters, is to
> use a bit-set:
> [...]
> In practice this seems to work well even for small search sets like my
> aeiou; its relative performance is 4.0.
>
> Further investigation with other workloads suggests that this bitset
> really works very well, and I note that it would be possible to write a
> specialisation of std::set<char> using it. This is probably worth pursuing.

Is it legal to specialize an std class for a non UDT?

>
> In each case, I've used the predicate with std::find_if. But if I
> instead use C's strcspn() to do the whole job, I see a relative
> performance for this example of 20. I find it depressing that the C++
> code doesn't come close to that.

Well, it is not really a fair comparison, the 'C' function is likely
implemented in hand optimized assembler, (http://tinyurl.com/yv8s7g)
and for such a small function, humans are still better than compilers.
 So it isn't really C++ versus C. Also the C version is not generic at
all.

>
> I think that every time I've used is_any_of or similar features the set
> of characters has been a compile-time constant. So is it possible to
> make the characters a template parameter, e.g. is_any_of<"ABC">? Sadly
> it seems that strings can't be template parameters (though apparently
> pointers can be; what is that used for?). But you can write this:
> [...]
> Presumably C++0x varadic templates make that simpler.
>
> Using this sort of technique with std::find_if I get a relative
> performance for my benchmark of 28 - finally better than C.

Nice!

-- gpd


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk