Boost logo

Boost :

Subject: Re: [boost] [UTF String] Feedback on UTF String library, please
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2011-02-14 10:54:01


Brent Spillner wrote:
> If you have the string "r?sum?" in your data store,

[Aside: I'm reading this via the "daily digest", and I find it sadly
appropriate that the mailing list software replaces all non-ASCII
characters with ?s...]

> and a user enters "resume" as a search
> key, then you either have to search independently for each plausible
> variant for "resume," or you have to maintain a parallel, accent-free
> copy of the original text and the means to translate indices into the
> accent-free version into indices into the rich version, or you have to
> have some notion of 'abstract character' comparison and iteration. I
> think the last solution is by far the most preferable, particularly
> for a low-level library. This is how people actually use strings in
> practice, after all.

Well, "how people actually use" and "for most applications" are things
that could be argued about ad infinitum. Let's not go there. Yes,
approximate-matching is useful. I have a to_ascii_character iterator
adaptor (or to_alphanumeric_character) that I can layer on top of a
utf8 iterator when I need that. More often, I just need a comparison
that will work for keys in a std::associative_container.

>>Even if you did want to do Boyer-Moore on the codepoint-sequence, I
>>think that using std::advance to move a bidirectional iterator would
>>not change its complexity order.
>
> It clearly won't; you need to ::advance at most O(n) times, and
> Boyer-Moore is already a linear complexity algorithm (and I'd be
> amazed to see a sublinear one for the general case.) But since the
> whole algorithm is linear, those extra O(n) iterator advancements past
> code units that you already know to be irrelevant could be a
> measurable contributor to the total run time.

But what are the options we're comparing here? I don't think there's
anything on the table that gives you O(1) random access to the
characters or codepoints of a UTF-8 string. Chad's proposal has an
iterator that gives you that only when the string doesn't contain any
non-ASCII characters, and needs to scan the string in O(N) time to
determine that; in general, it will emulate random access in linear
time. The hypothetical augmented tree representation can do random
access in log(N) time, but you still have to build the tree first which
is O(N). Using std::advance on a bidirectional iterator is probably
better than those alternatives.

If I were trying to implement fast Boyer-Moore on UTF-8, I would do it
on the bytes: the distance that you advance after a mismatch can be
thought of as a hint; if it tells you to advance N characters, I would
advance N bytes and then skip to the next codepoint-start byte. This
is, in my experience, typical of how UTF-8 can be most efficiently
used; if you don't forget that it's a byte sequence, you can do smart
things with it.

Regards, Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk