|
Boost : |
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2007-09-26 16:46:40
Joseph Gauterin wrote:
> IIRC, some of the non-const std::basic_string methods aren't suitable
> for handling variable width encodings like utf8 and utf16 - non-const
> operator[] in paticular returns a reference to the character type - a
> big problem if you want to assign a value > 0x7F (i.e. a character
> that uses 2 or more bytes).
Yes, very true. One option is to convert to a fixed-size character set
before doing anything like operator[], and to not allow strings of
variable-width character sets. If you do want to apply operator[] to a
UTF8 string, what type should it return? A reference to a range of
bytes, somehow? A proxy that encodes/decodes to a UCS4 character? Or,
you could say that the iterator is a byte iterator, not a character
iterator. Lots of possibilities.
> I've noticed that there are frequent requests/proposals for some sort
> of boost unicode/string encoding library. I've thought about the
> problem and it seems to big for one person to handle in their spare
> time
Let me say "part time" rather than "spare time"...
> - perhaps a group of us should get together to discuss working on
> one? I'd be happy to participate.
I would definitely encourage breaking the work up into smaller chunks.
IMHO "smaller is better" for Boost libraries; there have been a number
of occasions when I've discovered that a feature I want is hidden as an
internal component of a Boost library, and I've felt that it should
have been a stand-alone public entity. So let's think about how this
work can be split up:
- A charset_trait class. I have started on this. The missing piece is
a way to look up traits of character sets that are known at run-time;
input would be appreciated.
- Compile-time and run-time tagged strings. The basics of this are
straightforward and done.
- Conversions. My approach at present is to use iconv via a functor
that I wrote a while ago. I believe iconv is widely available;
however, some implementations may support only a small set of character
sets. Alternatives would be interesting.
- Variable width iterators, including the issue that you raised above.
- Interaction with locales, internationalisation, and system APIs.
and no doubt more. Thinking about the interfaces between these areas
and the user would be a good place to start.
Regards,
Phil.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk