Boost logo

Boost :

From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2006-10-01 06:03:44


On 9/16/06, loufoque <mathias.gaunard_at_[hidden]> wrote:
>
> Since no one has old code for reuse, I will start to write a few usable
> tools from scratch.
> Note that I am not an Unicode expert nor a C++ guru.
> I am just willing to work in that area and hope my code could be useful
> to some.

I'm sorry to enter the discussion this late, but I was unable to reply
earlier. Graham Barnett and I started on a Unicode library implementation a
year ago but failed to deliver anything. I can offer you two things. One is
some codecvt facets for UTF-8 and UTF-16, slightly faster and more up to
date than I think are in Boost now. I've been thinking how this whole
Unicode thing should proceed recently, so I'll also offer some advice.

Feel free to comment and give ideas, since I think the design is the
> most important thing first, especially for usage with boost, even though
> this topic has already been discussed a few times.
>
> string/wstring is not really suited to contain unicode data, since of
> limitations of char_traits, the basic_string interface, and the
> dependance on locales of the string and wstring types.
> I think it is better to consider the string, char[], wstring and
> wchar_t[] types to be in the system locales and to use a separate type
> for unicode strings.
>
> The aim would then be to provide an abstract unicode string type
> independent from C++ locales on the grapheme clusters level, while also
> giving access to lower levels.
> It would only handle unicode in a generic way at the beginning (no
> locales or tailored things).
> This string could maintain the characters in a normalized form (which
> means potential loss of information about singleton characters) in order
> to allow more efficient comparison and searching.

I fully agree with this. It may be a good idea to separate the library into
smaller modules. The grapheme-based string will probably use a string of
code points underlyingly. Given that, you may want to implement a UTF
library first, which should just deal with the codepoints <-> code units
conversion. Setting out to design this UTF library first will also
concentrate and streamline the discussion. The Boost community is
English-language centred, and not everyone may be intimately familiar with
the concept of grapheme clusters. When building a real Unicode library on
top of a UTF library, discussion can focus on handling grapheme clusters,
normalisation, and the Unicode database you'll need for that.

(Note that when you say "comparison" and "searching" you're speaking of just
binary comparison; for locale-specific collation you'll probably want to
attach sort keys to strings for efficiency. That's for later, though.)

Just my 2p. I'd be delighted to explain my views in more detail.
Regards,
Rogier


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk