Boost logo

Boost :

From: Erik Wien (wien_at_[hidden])
Date: 2004-10-20 14:21:17


"Peter Dimov" <pdimov_at_[hidden]> wrote in message news:00d501c4b6a2$f541bda0

> That may be so. But I don't see how the user can be isolated from the
> binary representation if he needs to pick one of utf8_string,
> utf16_string, ucs2_string, ucs4_string to store his strings. Perhaps I
> misunderstand your idea. Can you post a sketch of your spec? How many
> string classes do you have? What encoding do they use? What do begin(),
> end(), size() return? Are the iterators random access? Bidirectional?
> Constant? How can the user obtain the underlying element sequence to
> persist it somewhere or to pass it to an external library?

First you need to understand that what I have so far, is just a preliminary
test implementation for my own amusement. I anticipate a lot of things will
change if I go forward with this project.

Right now i have a single encoded_string class that has two template
parameters, namely encoding and encoding_traits. encoding_traits is a class
where all encoding specific implementation is kept, and this class is used
to setup the encoded_string class to correctly represent strings in the
given encoding.

begin() and end() return a code unit iterator that has the same interface
and value_type ++, no matter what the underlying encoding is. That is, you
only see code points when iterating over a string, not the underlying code
unit sequence.

The iterators used are bidirectional, not random access (impossible on UTF-8
and UTF-16) and they are as of now not constant. It IS possible to assign a
code unit to a UTF-8 encoded string through an iterator, even if the
resulting code unit sequence would be longer than the one the iterator is
pointing to. The underlying container is automatically resized to make room
for the new sequence. (This is of course slow!)


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk