Boost logo

Boost :

Subject: Re: [boost] [unicode] Interest Check / Proof of Concept
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2008-11-20 15:42:41


Eric Niebler wrote:
> Zach Laine wrote:
>>> Over the past few months, I've been tinkering with a Unicode string library.
>>> It's still *far* from finished, but it's far enough along that the overall
>>> structure is visible. I've seen a bunch of Unicode proposals for Boost come
>>> and go, so hopefully this one will address the most common needs people
>>> have.
>>
>> I would love to see a Unicode support library added to Boost.
>> However, I question the usefulness of another string class, or in this
>> case another hierarchy of string classes. Interoperability with
>> std::string (and QString, and CString, and a thousand other
>> API-specific string classes) is always thorny. I'd much rather see an
>> iterators- and algorithms-based approach
> <snip>
>
> Agree. Thanks Zach. I'm discouraged that every time the issue of a
> Unicode library comes up, the discussion immediately descends into a
> debate about how to design yet another string class. Such a high level
> wrapper *might* be useful (strong emphasis on "might"), but the core
> must be the Unicode algorithms, and the design for a Unicode library
> must start there.

I mostly agree. If people want UTF-8 and UTF-16 iterator-adaptors that
will efficiently convert byte-sequence iterators into unicode character
iterators, then I probably already have exactly that. Should I package
it up for review?

There are, however, a few points to consider. Most importantly, if you
operate on a UTF-8 string only using an iterator-adaptor then you'll
miss out on most of the clever features of the encoding. Specifically:

- If you need to search for an ASCII character in a UTF-8 string then
you can do so just by scanning the bytes.
- Similarly, searching for substrings (including substrings with
non-ASCII characters) can be done just by scanning for a bytewise match.
- Sorting can be done using strcmp()-like comparisons on the byte sequences.

An implementation that doesn't somehow exploit these optimisations will
perform sub-optimally, and I don't think that would be acceptable.

I don't really have a complete solution to offer. What I do have is
the beginnings of a character-set traits class with booleans indicating
things like "is an ASCII superset", "is variable-length" etc. The idea
is that algorithms could be specialised based on these traits. I'm not
sure how it all joins together yet though.

Cheers, Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk