Boost logo

Boost :

Subject: Re: [boost] [UTF String] Feedback on UTF String library, please
From: Brent Spillner (spillner_at_[hidden])
Date: 2011-02-14 07:16:14


On 2/11/2001 17:05:25, Phil Endecott wrote:
>Brent Spillner wrote:
>>>Just when is it really valid to want to jump the 278th "abstract
>>>character" in a string?
>>>
>>>Seriously, how often do these situations arise?
>>
>> Well, Boyer-Moore string searching, obviously.
>
>If you're doing Boyer-Moore on a UTF-8 string you should normally do it
>on the bytes, not on the codepoints, so there is no need for random
>access to the codepoints.  This works because of the particular
>properties of UTF-8, and will be faster because you don't have the
>overhead of the UTF-8 decoding.

The discussion was about 'abstract characters,' not codepoints. I
think a codepoint iterator is likely useful too, but for most
applications the character view is required.

>The exception would be if you want to do comparison at the "grapheme
>cluster" level, i.e. if you want to consider a single-codepoint,
>multi-byte representation of e.g. an accented character as equal to the
>multiple-codepoint, multi-byte representation comprising a base

Yes, I think that's the behavior desired for most applications.

>character plus a combining accent.  I would suggest that it would be
>better to first normalise your strings, so that this doesn't arise.

...but I think that's a little too glib. If you have the string
"résumé" in your data store, and a user enters "resume" as a search
key, then you either have to search independently for each plausible
variant for "resume," or you have to maintain a parallel, accent-free
copy of the original text and the means to translate indices into the
accent-free version into indices into the rich version, or you have to
have some notion of 'abstract character' comparison and iteration. I
think the last solution is by far the most preferable, particularly
for a low-level library. This is how people actually use strings in
practice, after all.

>Even if you did want to do Boyer-Moore on the codepoint-sequence, I
>think that using std::advance to move a bidirectional iterator would
>not change its complexity order.

It clearly won't; you need to ::advance at most O(n) times, and
Boyer-Moore is already a linear complexity algorithm (and I'd be
amazed to see a sublinear one for the general case.) But since the
whole algorithm is linear, those extra O(n) iterator advancements past
code units that you already know to be irrelevant could be a
measurable contributor to the total run time.

>Are you suggesting some sort of indexed string?  I guess you could use
>an augmented tree to provide O(log N) access to the N'th character in a
>UTF-8 string.  In fact didn't someone propose something like that a few

Yes, that's what I was thinking.

>months ago?  Not as a string but as a "vector whose iterators are not
>invalided" or "list with random access" or something?

Good to know; I'll take a look at that.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk