Boost :

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] RFC: interest in Unicode codecs?
From: Graham (Graham_at_[hidden])
Date: 2009-02-14 05:53:20

Next message: Vladimir Batov: "Re: [boost] boost::string namespace and string-conversion functions"
Previous message: Andrey Semashev: "Re: [boost] Review Request: Introduction of boost::string namespace and string-conversion functions"
Maybe in reply to: Cory Nelson: "[boost] RFC: interest in Unicode codecs?"
Next in thread: Esben Mose Hansen: "Re: [boost] RFC: interest in Unicode codecs?"
Reply: Esben Mose Hansen: "Re: [boost] RFC: interest in Unicode codecs?"
Reply: Phil Endecott: "Re: [boost] RFC: interest in Unicode codecs?"

Dear Phil,

>Having said all that, I must say that I actually use the code that I
>wrote quite rarely. I now tend to use UTF8 everywhere and treat it as
>a sequence of bytes. Because of the properties of UTF8 I find it's
>rare to need to identify individual code points. For example, if I'm
>scanning for a matching " or ) I can just look for the next matching
>byte, without worrying about where the character boundaries are.

Using UTF-8 can work well if you are only targeting American and Western
Europe for non-literary use.

If you need to support the rest of the world you really need to move to
UTF-32 due to the large number of characters and the grapheme and glyph
handling [e.g. in Urdu you can type 3 characters and they are displayed
as a single combined glyph, and the cursor should never be placed
between them].

Even in UTF-8 things can get a bit tricky. For example, where do you
break the line if you needed in the middle of:
joe)jack -> joe) <br> jack
joe(jack -> joe <br> (jack
joe+jack -> guess which is the standard !

For programmers we don't mind too much, but when you are writing text
editors this can be really important.

Now think how many characters there are with special rules on whether
they can be split before, after, or never split, and you start to touch
on the reason for the Unicode standard and why you need character
properties.

Yours,

Graham

Next message: Vladimir Batov: "Re: [boost] boost::string namespace and string-conversion functions"
Previous message: Andrey Semashev: "Re: [boost] Review Request: Introduction of boost::string namespace and string-conversion functions"
Maybe in reply to: Cory Nelson: "[boost] RFC: interest in Unicode codecs?"
Next in thread: Esben Mose Hansen: "Re: [boost] RFC: interest in Unicode codecs?"
Reply: Esben Mose Hansen: "Re: [boost] RFC: interest in Unicode codecs?"
Reply: Phil Endecott: "Re: [boost] RFC: interest in Unicode codecs?"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk