Boost logo

Boost :

Subject: Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]
From: Lassi Tuura (lat_at_[hidden])
Date: 2011-01-20 04:17:45


Hi,

> The OED lists ~600,000 words, so 32 bits is enough space to provide a
> fully pictographic alphabet for over 7,000 languages as rich as English,
> with room for a few line-drawing characters left over. Surely that's enough?

It could be. Depends on what problems you are trying to solve.

Languages in the world operate in many interestingly different ways. Enabling computers to input, store, display, typeset, hyphenate, search, spell check, render to speech, and perform other multi-lingual text tasks sometimes involves rules more complex than those used for English.

Unicode consortium (unicode.org) provides lots of excellent material on these issues, including FAQs. If you are genuinely interesting in solving text processing issues for the entire world, I highly recommend a visit over there.

Not all software needs to care about those problems. For a library one needs to decide which set of tasks and languages to support. If the target is all text processing tasks for the entire world, one may end up having strange ideas like variable number of code units or that random access to strings is lower priority.

Then there are constraints. Coming across as unnecessarily having doubled the app memory use might earn library designers some seriously bad reputation. Refusing to, say, display all files in a directory may get users upset, even if the filenames aren't valid by some standard or another.

Of course when there are other goals - perhaps software needs to handle any text but treats it as an opaque blob, or perhaps author values beauty of internal design more than supporting languages in far-flung corners of the world, or the app is such that butchering the names of 50% of world's population will have no dire consequences - one will likely end up with a different design.

To give you a taste of some the complex issues, here's a few quotes from South Asian Scripts FAQ http://www.unicode.org/versions/Unicode5.0.0/ch09.pdf:

> The writing systems that employ Devanagari and other Indic scripts constitute abugidas -- a cross between syllabic writing systems and alphabetic writing systems. The effective unit of these writing systems is the orthographic syllable, consisting of a consonant and vowel (CV) core and, optionally, one or more preceding consonants, with a canonical structure of (((C)C)C)V. [...] Devanagari characters, like characters from many other scripts, can combine or change shape depending on their context. [...] Additionally, a few Devanagari characters cause a change in the order of the displayed characters. [...] Some Devanagari consonant letters have alternative presentation forms whose choice depends on neighboring consonants. [...] Devanagari has a collection of nonspacing dependent vowel signs that may appear above or below a consonant letter, as well as spacing dependent vowel signs that may occur to the right or to the left of a consonant letter or consonant cluster. [...] If the superscript mark RAsup is to be applied to a dead consonant that is subsequently replaced by its half-consonant form, then the mark is positioned so that it applies to the form that serves as the base of the consonant cluster. [...]

You might want to also read:

http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html
http://blog.mozilla.com/dmandelin/2008/02/14/wtf-16/

Regards,
Lassi


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk