Boost logo

Boost :

From: Erik Wien (wien_at_[hidden])
Date: 2005-04-04 06:31:04


Sundell Software wrote:
> Each UTF-8/16/32 has its own iterator type, but all output UTF-32 when
> accessed. Look at std::istream_iterator/std::ostream_iterator for
> design. There would propably be helper functions for the most common
> tasks and i think you should be able to do all the nessesary tasks
> with just iterators.

Yep. That is basically how the current implementation works. It's all
(bi-directional) iterators. A unicode string is by nature a
bi-directional sequence, so your basically forced to work that way.

> typedef basic_string<utf_8> ustring8;
> typedef basic_string<utf_16> ustring16;
>
> ustring8 u8;
> ustring16 u16;
>
> // Would propably make .begin() default.
> unicode_iterator i8(u8, u8.begin());
>
> // This would be a slow way of doing operator[]. the assignment would
> // insert/remove elements from the basic_string if nessesary.
> *std::advance(unicode_iterator(u16, u16.begin()), 5) = *(i8++);
>
> Note that the client is responible for giving a valid iterator to
> unicode_iterator.

An implementation like this is already in place, but not locked to
basic_string. A mutable code_point_iterator (unicode_iterator in your
code) can be created from any random access sequence. You won't be
getting random access to the unicode sequence though, like I mentioned
above.

>
> BTW, is using UTF-8/16 in the container really overall cheaper than
> UTF-32. Since if the client changes a character, and it happens to be
> larger/smaller then all the elements behind it would need to be moved.
> Does that happen rarely enough? Though the client should propably know
> that themselves.

UTF-8, no. That is for people who require small size above all. But
UTF-16 usually is, unless you are using some obscure language that is
not within the BMP (Basic Multilingual Plane).

- Erik


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk