Boost logo

Boost :

Subject: Re: [boost] RFC: interest in Unicode codecs?
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2009-02-14 12:07:54


Graham wrote:
> Dear Phil,
>
>>Having said all that, I must say that I actually use the code that I
>>wrote quite rarely. I now tend to use UTF8 everywhere and treat it as
>>a sequence of bytes. Because of the properties of UTF8 I find it's
>>rare to need to identify individual code points.

Note that I said "rare" not "never", and that in the part that you
didn't quote I explained that I do have code to extract code points
from UTF-8 byte sequences.

>>For example, if I'm
>>scanning for a matching " or ) I can just look for the next matching
>>byte, without worrying about where the character boundaries are.
>
> Using UTF-8 can work well if you are only targeting American and Western
> Europe for non-literary use.
>
> If you need to support the rest of the world you really need to move to
> UTF-32 due to the large number of characters and the grapheme and glyph
> handling

UTF-8 encodes the same characters as UTF-32. I wonder if you miss-read
"UTF-8" as "ISO-8859-1"?

> [e.g. in Urdu you can type 3 characters and they are displayed
> as a single combined glyph, and the cursor should never be placed
> between them].

Right. This is a very complex area. But I don't think the choice of
UTF-8 or UTF-32 makes much difference. If you use UTF-32 you can have
efficient random access which you can't with UTF-8. UTF-8 will be more
compact than UTF-32 in all but the most contrived cases. Whether
compactness or efficiency of random access matters to you will depend
on your application. These are almost the only ways in which the
choice of encoding matters.

> Even in UTF-8 things can get a bit tricky. For example, where do you
> break the line if you needed in the middle of:
> joe)jack -> joe) <br> jack
> joe(jack -> joe <br> (jack
> joe+jack -> guess which is the standard !

I don't see how this influences your choice of UTF variant.

> For programmers we don't mind too much, but when you are writing text
> editors this can be really important.
>
> Now think how many characters there are with special rules on whether
> they can be split before, after, or never split, and you start to touch
> on the reason for the Unicode standard and why you need character
> properties.

Yes, a Unicode character properties library is important to those who
are writing text editors and similar applications. Perhaps Boost
should have one. I have personally used the Unicode properties tables
for doing "approximate matching" of e.g. accented characters with their
base characters when searching. But I can do that equally well in
UTF-8 as in UTF-32.

Regards, Phil.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk