Boost logo

Boost :

Subject: Re: [boost] [unicode] Interest Check / Proof of Concept
From: James Porter (porterj_at_[hidden])
Date: 2008-11-19 16:22:09


Andrew Sutton wrote:
> I think it looks like a good start. I'm getting a warning about a
> string->wchar_t conversion.

I think gcc is complaining because it defines wchar_t as 32 bits.
Honestly, wchar_t is pretty awful since its size is platform-dependent,
but I don't think any compiler supports the new Unicode strings yet. :)
I suppose I could have said "int16_t raw[] = { 'T', 'e', 's', 't', ...
};", but that's not very readable!

> Just a couple comments/questions...
> - I don't think the global rt encoding objects are the way to go. I would
> just each each string object declare the encoding object either as a member
> variable or as needed inside a member function. Since they don't have any
> member variables, the cost is negligible.

This is probably workable. Do you envision something like the following?

        my_string.encode(source,utf8());

It would have the benefit of making the interface for ct_strings and
rt_strings the same. For ct_strings, it would specialize on the type of
the encoding parameter, and for rt_strings, it would wrap the encoding
up in some object to give it virtual dispatch.

> - Would it be possible to merge the ct/rt classes into a single type?

This would definitely be possible. Assuming I can make the interface
identical, I could just make a special "encoding type" for ct_strings to
make them behave like rt_strings do now.

> - Maybe encode/decode should be free functions - algorithm like.
>
> You might have something like:
>
> estring<> s= ...; // Create an encodeable string with some default encoding
> (ascii?)
> encode(s, utf8()); // utf8 is a functor object that returns a utf8_encoder
> object.
>
> I guess if you go this way, the estring class would just contain an encoded
> string associated with the encoder type. It might be an interesting
> approach. Still. A good start.

Do you envision the encode algorithm re-encoding the contents of s into
a new encoding, or just tagging s with a "utf8" encoding? Perhaps a
better verb for "encode" would have been "transcode", since it's
responsible for decoding from a source and encoding to a target.
"encode" sounds better though. :)

- Jim


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk