Boost logo

Boost :

From: Daryle Walker (darylew_at_[hidden])
Date: 2004-10-22 18:26:34


I think that there should be at most two _external_ Unicode string types:

1. vector of Unicode code-points
    externally each element is like a int32_t
2. vector of abstract Unicode characters
    externally each element is a group of Unicode code-points
    (a primary or starting code point, followed by combiner codes)

Option [2] is what we ultimately want, so [1] should be included only if
really needed. Internally, such strings could use UTF-8, UTF-16, etc., but
most users don't care about that from an outside perspective.

Some users do care about the outside appearance. (I think some guy here
wanted UTF-8 XML.) In those cases we have a specific input or output
routine that uses an appropriate encoding object, to hide whether or not the
Unicode string internally uses the same encoding as the final source/sink.

The internals of the Unicode string could use the cluster concept, like in
Mac OS X's Cocoa. Here we would make concrete classes for UTF-8, UTF-16,
UTF-32, etc. strings. (We could include normalization and other factors in
combinations too.) The external class would keep a union (or something)
that uses one of the concrete classes.

Iterators probably should be made for code points and/or abstract
characters. Bidirectional travel would be easiest. Such iterators should
be configured (at compile- and/or run-time) for various normalization
schemes.

Input needs special handling, since we shouldn't allow ultimately invalid
byte/code-point combinations into Unicode strings. We need something that
can enumerate over a byte stream for a particular encoding and spit out
whole code-points (or queue the code-points and spit out abstract
characters).

-- 
Daryle Walker
Mac, Internet, and Video Game Junkie
darylew AT hotmail DOT com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk