Boost logo

Boost :

Subject: Re: [boost] [unicode] Interest Check / Proof of Concept
From: Zach Laine (whatwasthataddress_at_[hidden])
Date: 2008-11-20 16:05:48


> Eric Niebler wrote:
>>
>> Zach Laine wrote:
>>>>
>>>> Over the past few months, I've been tinkering with a Unicode string
>>>> library.
>>>> It's still *far* from finished, but it's far enough along that the
>>>> overall
>>>> structure is visible. I've seen a bunch of Unicode proposals for Boost
>>>> come
>>>> and go, so hopefully this one will address the most common needs people
>>>> have.
>>>
>>> I would love to see a Unicode support library added to Boost.
>>> However, I question the usefulness of another string class, or in this
>>> case another hierarchy of string classes. Interoperability with
>>> std::string (and QString, and CString, and a thousand other
>>> API-specific string classes) is always thorny. I'd much rather see an
>>> iterators- and algorithms-based approach
>>
>> <snip>
>>
>> Agree. Thanks Zach. I'm discouraged that every time the issue of a Unicode
>> library comes up, the discussion immediately descends into a debate about
>> how to design yet another string class. Such a high level wrapper *might* be
>> useful (strong emphasis on "might"), but the core must be the Unicode
>> algorithms, and the design for a Unicode library must start there.
>
> I mostly agree. If people want UTF-8 and UTF-16 iterator-adaptors that will
> efficiently convert byte-sequence iterators into unicode character
> iterators, then I probably already have exactly that. Should I package it
> up for review?

Perhaps joining forces with Jim Porter would produce more interesting
results than either of you would produce in isolation.

> There are, however, a few points to consider. Most importantly, if you
> operate on a UTF-8 string only using an iterator-adaptor then you'll miss
> out on most of the clever features of the encoding. Specifically:
>
> - If you need to search for an ASCII character in a UTF-8 string then you
> can do so just by scanning the bytes.
> - Similarly, searching for substrings (including substrings with non-ASCII
> characters) can be done just by scanning for a bytewise match.
> - Sorting can be done using strcmp()-like comparisons on the byte sequences.
>
> An implementation that doesn't somehow exploit these optimisations will
> perform sub-optimally, and I don't think that would be acceptable.

These are all good and worthy things to have in the UTF-8 portion of a
Unicode library. Note that they all describe algorithms. My
suggestion is for a library based on iterators and algorithms instead
of based on a hierarchy of string classes.

Zach


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk