|
Boost : |
From: Daryle Walker (darylew_at_[hidden])
Date: 2004-10-22 18:26:34
I think that there should be at most two _external_ Unicode string types:
1. vector of Unicode code-points
externally each element is like a int32_t
2. vector of abstract Unicode characters
externally each element is a group of Unicode code-points
(a primary or starting code point, followed by combiner codes)
Option [2] is what we ultimately want, so [1] should be included only if
really needed. Internally, such strings could use UTF-8, UTF-16, etc., but
most users don't care about that from an outside perspective.
Some users do care about the outside appearance. (I think some guy here
wanted UTF-8 XML.) In those cases we have a specific input or output
routine that uses an appropriate encoding object, to hide whether or not the
Unicode string internally uses the same encoding as the final source/sink.
The internals of the Unicode string could use the cluster concept, like in
Mac OS X's Cocoa. Here we would make concrete classes for UTF-8, UTF-16,
UTF-32, etc. strings. (We could include normalization and other factors in
combinations too.) The external class would keep a union (or something)
that uses one of the concrete classes.
Iterators probably should be made for code points and/or abstract
characters. Bidirectional travel would be easiest. Such iterators should
be configured (at compile- and/or run-time) for various normalization
schemes.
Input needs special handling, since we shouldn't allow ultimately invalid
byte/code-point combinations into Unicode strings. We need something that
can enumerate over a byte stream for a particular encoding and spit out
whole code-points (or queue the code-points and spit out abstract
characters).
-- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk