Boost logo

Boost :

From: Anthony Williams (anthony_w.geo_at_[hidden])
Date: 2004-10-26 07:43:22

Daryle Walker <darylew_at_[hidden]> writes:

> I think that there should be at most two _external_ Unicode string types:
> 1. vector of Unicode code-points
> externally each element is like a int32_t
> 2. vector of abstract Unicode characters
> externally each element is a group of Unicode code-points
> (a primary or starting code point, followed by combiner codes)

> Some users do care about the outside appearance. (I think some guy here
> wanted UTF-8 XML.) In those cases we have a specific input or output
> routine that uses an appropriate encoding object, to hide whether or not the
> Unicode string internally uses the same encoding as the final source/sink.

> Iterators probably should be made for code points and/or abstract
> characters. Bidirectional travel would be easiest. Such iterators should
> be configured (at compile- and/or run-time) for various normalization
> schemes.

> Input needs special handling, since we shouldn't allow ultimately invalid
> byte/code-point combinations into Unicode strings. We need something that
> can enumerate over a byte stream for a particular encoding and spit out
> whole code-points (or queue the code-points and spit out abstract
> characters).

The XML parser I have under development on Sourceforge
( includes string handling facilities that
support the above. I haven't yet found a need for dealing with "abstract
characters", since the closest thing in XML (name matching) requires that the
names use the same sequence of code points (including combining characters) in
all places.

The "vector of Unicode code-points" I use is a
std::basic_string<UnicodeCharacter>, where UnicodeCharacter is a POD struct
with a 32-bit int member to represent the unicode code point. I have to do it
that way to allow customization of std::char_traits, since you cannot
specialize std::char_traits for built-in types.


Anthony Williams
Software Developer

Boost list run by bdawes at, gregod at, cpdaniel at, john at