Boost logo

Boost :

From: Daryle Walker (darylew_at_[hidden])
Date: 2004-10-22 18:26:34


I think that there should be at most two _external_ Unicode string types:

1. vector of Unicode code-points
    externally each element is like a int32_t
2. vector of abstract Unicode characters
    externally each element is a group of Unicode code-points
    (a primary or starting code point, followed by combiner codes)

Option [2] is what we ultimately want, so [1] should be included only if
really needed. Internally, such strings could use UTF-8, UTF-16, etc., but
most users don't care about that from an outside perspective.

Some users do care about the outside appearance. (I think some guy here
wanted UTF-8 XML.) In those cases we have a specific input or output
routine that uses an appropriate encoding object, to hide whether or not the
Unicode string internally uses the same encoding as the final source/sink.

The internals of the Unicode string could use the cluster concept, like in
Mac OS X's Cocoa. Here we would make concrete classes for UTF-8, UTF-16,
UTF-32, etc. strings. (We could include normalization and other factors in
combinations too.) The external class would keep a union (or something)
that uses one of the concrete classes.

Iterators probably should be made for code points and/or abstract
characters. Bidirectional travel would be easiest. Such iterators should
be configured (at compile- and/or run-time) for various normalization
schemes.

Input needs special handling, since we shouldn't allow ultimately invalid
byte/code-point combinations into Unicode strings. We need something that
can enumerate over a byte stream for a particular encoding and spit out
whole code-points (or queue the code-points and spit out abstract
characters).

-- 
Daryle Walker
Mac, Internet, and Video Game Junkie
darylew AT hotmail DOT com

Boost list run by bdawes at acm.org, david.abrahams at rcn.com, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk