Boost logo

Boost :

From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2004-10-20 08:08:33


On Tue, 19 Oct 2004 18:32:50 +0200, Erik Wien <wien_at_[hidden]> wrote:
> ----- Original Message -----
> From: "Rogier van Dalen" <rogiervd_at_[hidden]>
>
> > I've recently started on the first draft of a Unicode library.
> >
>
> Interesting. Is there a discussion going about this library that I have
> missed, or haven't you posted anything about it yet? I'd hate to start
> something like this, if there is already being made an effort on the
> subject.

It's in the planning stage; I have a preliminary implementation of
some parts. Your message made me bring out my ideas into the public.

> > I think a definition of unicode::code as uint32_t would be much
> > better. Problem is, codecvt is only implemented for wchar_t and char,
> > so it's not possible to make a Unicode codecvt without manually adding
> > (dummy) implementations of codecvt<unicode::code,char,mbstate_t> to
> > the std namespace. I guess this is the reason that Ron Garcia just
> > used wchar_t.
> >
> I don't really feel locking the code unit size to 32bits is a good solution
> either as strings would then become unneccesarily large.

As I tried to show, the choice of the underlying buffer is templated.
This could be std::string, or an SGI rope<wchar_t>, or anything else.
A char-based buffer would automatically make it a UTF-8-encoded
string, etcetera. I agree with you (and with the Unicode standard)
that using strings of UTF-16 is probably best for most practical
applications. The interface should IMHO always use UTF-32 (I agree
with the Unicode standard here too):
codepoint_string<...> s = ....;
I think *s.begin() should return a UTF-32-encoded codepoint.

The codecvt class converts to UTF-32 because it didn't occur to me to
do anything else; and why would you?

Regards,
Rogier


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk