Boost logo

Boost :

From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2007-09-26 16:46:40


Joseph Gauterin wrote:
> IIRC, some of the non-const std::basic_string methods aren't suitable
> for handling variable width encodings like utf8 and utf16 - non-const
> operator[] in paticular returns a reference to the character type - a
> big problem if you want to assign a value > 0x7F (i.e. a character
> that uses 2 or more bytes).

Yes, very true. One option is to convert to a fixed-size character set
before doing anything like operator[], and to not allow strings of
variable-width character sets. If you do want to apply operator[] to a
UTF8 string, what type should it return? A reference to a range of
bytes, somehow? A proxy that encodes/decodes to a UCS4 character? Or,
you could say that the iterator is a byte iterator, not a character
iterator. Lots of possibilities.

> I've noticed that there are frequent requests/proposals for some sort
> of boost unicode/string encoding library. I've thought about the
> problem and it seems to big for one person to handle in their spare
> time

Let me say "part time" rather than "spare time"...

> - perhaps a group of us should get together to discuss working on
> one? I'd be happy to participate.

I would definitely encourage breaking the work up into smaller chunks.
IMHO "smaller is better" for Boost libraries; there have been a number
of occasions when I've discovered that a feature I want is hidden as an
internal component of a Boost library, and I've felt that it should
have been a stand-alone public entity. So let's think about how this
work can be split up:

- A charset_trait class. I have started on this. The missing piece is
a way to look up traits of character sets that are known at run-time;
input would be appreciated.

- Compile-time and run-time tagged strings. The basics of this are
straightforward and done.

- Conversions. My approach at present is to use iconv via a functor
that I wrote a while ago. I believe iconv is widely available;
however, some implementations may support only a small set of character
sets. Alternatives would be interesting.

- Variable width iterators, including the issue that you raised above.

- Interaction with locales, internationalisation, and system APIs.

and no doubt more. Thinking about the interfaces between these areas
and the user would be a good place to start.

Regards,

Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk