Boost logo

Boost :

From: Joseph Gauterin (joseph.gauterin_at_[hidden])
Date: 2007-09-26 17:21:21


> Yes, very true. One option is to convert to a fixed-size character set
> before doing anything like operator[], and to not allow strings of
> variable-width character sets. If you do want to apply operator[] to a
> UTF8 string, what type should it return? A reference to a range of
> bytes, somehow? A proxy that encodes/decodes to a UCS4 character? Or,
> you could say that the iterator is a byte iterator, not a character
> iterator. Lots of possibilities.
I'd add making the string classes immutable to the list. That way
dereferencing an iterator (by which I mean calling unary op*) of any
type could then return a unicode code point by value. Mutable
sequences that pretend to hold a different type than they actually do
don't work well with C++ idoms (e.g. vector<bool>). Strings could be
built using a stringstream like approach or by using concatenation
(with possible expression template optimizations).

Making the iterator a byte iterator, not a code point iterator, pushes
the responsibility for knowing how to handle the variable widthness of
the different encodings back onto the user.

There are certainly a lot of possibilities, and we should try to get
some sort of consensus before we go further with this.

> I would definitely encourage breaking the work up into smaller chunks.
Agreed

> Conversions. My approach at present is to use iconv via a functor
> that I wrote a while ago. I believe iconv is widely available;
> however, some implementations may support only a small set of character
> sets. Alternatives would be interesting.
IIRC, iconv is licensed under the GPL, which would prevent it from
being integrated into boost. We should make whatever interface we come
up with easily extendable, so that people could write add support for
whatever encoding they require, possibly using iconv if using GPL
software isn't a problem with them.

> - Interaction with locales, internationalisation, and system APIs.
We'll definitely need a way to convert to a raw pointer representation
(like std::string.c_str()) for interaction with some APIs.

Lots to think about.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk