Boost logo

Boost :

Subject: Re: [boost] [UTF String] Feedback on UTF String library, please
From: Marsh Ray (marsh_at_[hidden])
Date: 2011-02-11 15:09:27


On 02/11/2011 11:44 AM, Chad Nelson wrote:
> On Fri, 11 Feb 2011 17:22:50 +0000
> "Phil Endecott"<spam_from_boost_dev_at_[hidden]> wrote:
>
>> For example, you have this "almost" random-access feature that, IIUC,
>> for UTF-8 will give you O(1) random access if you have only ASCII
>> characters and for UTF-16 will give you O(1) random access if you
>> have only BMP characters. That's just horrible! [...]
>
> If you put it that way, you're right. I assumed that the developer
> using the library would read the documentation and know that the
> iterators weren't always true random-access, but that assumption
> doesn't stand up to conscious examination.

We've heard this argument against UTF-8 many times. Like many of us,
I've worked with a lot of code to process a lot of text over many years.
I'd like to question this idea that random access to arbitrary character
data is really very relevant.

The difference between O(1) and O(N) isn't significant until N becomes
nontrivial. Which in practical terms probably in the dozens or hundreds
of characters.

So let me ask the question:

Just when is it really valid to want to jump the 278th "abstract
character" in a string?

Seriously, how often do these situations arise?

A guy who's only ever programmed "US ASCII" on a plain text terminal may
think he needs every 80th character in reverse order to get a column
from a screen line or something. But he would be wrong anywhere that
uses controls, compose characters, non-spacing blanks, multibyte, or
whatever.

Some string search and regex algorithms use skip-ahead N, but how often
is N large enough to avoid a whole cache line fill?

Isn't it sufficient to simply document the behavior that derives from a
straightforward implementation of the API?

- Marsh


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk