Boost logo

Boost :

From: loufoque (mathias.gaunard_at_[hidden])
Date: 2006-09-17 21:28:18


Peter Bindels wrote :

> Indexing in UTF32 is trivial. Indexing in UTF16 is fairly trivial, and
> by the definition of the boundary between the base UTF-16 plane and
> the higher plane you should treat all characters >0xFFFF (encoded with
> two entries) as very irregular. You could then keep an array of
> indexes where these characters appear in your string (adding a slight
> bit to the overhead) making overhead constant-time except for the
> occurrences of those characters.

I had already thought of this.
This would allow random access for O(log(n)), n being the number of
surrogate pairs in the string.
It also allows mutable iterators to not invalidate each other when
modifying the string.

However, for working on the grapheme clusters level, I don't think
random access of the underlying encoding brings anything useful.
Some searching algorithms need random access though, but these could
work on the bytes with a few checks, since all UTF encodings (except
UTF-7) guarantee that a sequence must not occur within a longer sequence
or across the boundary of two other sequences.

While composing characters may be rare in some scripts, there are
numerous in others so indexing them doesn't seem like such a good idea.

It can be interesting though to provide random access for people who
want to work at a lower level though, but that's not a priority.

> You cannot add this technique to
> UTF-8 texts because non-7-bit characters are a lot more common.

Another problem with utf-8 is that reverse iterating is a bit more
expensive than forward iterating, because you don't know how many bytes
you have to go back until you encounter the first of the multibyte group.

UTF-8 isn't usually a good idea unless you need it to feed GTK+ or
others libs.

Actually, size-wise, another interesting encoding is GB18030, especially
if working with a mix of Chinese and English.
However there is no direct mapping with Unicode code points, so UTF-16
stays the best choice by default.

> That's a point I hadn't thought of.

That's what the whole "grapheme cluster" thing is about.

<e acute> and <e><acute> should be considered equal, each one being a
single grapheme cluster, what an end-user thinks of as a single character.
Also searching foo in foo<acute> shouldn't give any result, since the
last character is o<acute>, not o.

All algorithms must ensure that elements of a grapheme cluster don't get
separated. This can be achieved simply by working with iterators over
them or, in a more optimized way, by working carefully with code units
or code points.
That is why, along with the default grapheme cluster interface, access
will be given to lower-level layers, for the power users.

> In that case, what advantages does
> UTF-32 hold over any of the other two?

First some people might want to directly work with code points and not
grapheme clusters, just like most (if not all) other unicode string
classes work.
With UTF-32 the code units are code points, so working on them is very
lightweight.

Also, it can be useful if you need to interface with something that uses
UTF-32 since that way no conversion is needed for input and output. That
is also why there will be an utf-8 backend too.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk