|
Boost : |
Subject: Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2011-01-20 08:18:51
On 20/01/2011 09:41, bernardH wrote:
> Dave Abrahams<dave<at> boostpro.com> writes:
>
>>
>> At Wed, 19 Jan 2011 23:25:34 +0000,
>> Brent Spillner wrote:
>>>
>>> On 1/19/2011 11:33 AM, Peter Dimov wrote:
>>>> This was the prevailing thinking once. First this number of bits was 16,
>>>> which incorrect assumption claimed Microsoft and Java as victims, then
>>>> it became 21 (or 22?). Eventually, people realized that this will never
>>>> happen even if we allocate 32 bits per character, so here we are.
>>>
>>> The OED lists ~600,000 words, so 32 bits is enough space to provide a
>>> fully pictographic alphabet for over 7,000 languages as rich as English,
>>> with room for a few line-drawing characters left over. Surely that's enough?
>>
>> Even if it's theoretically possible, the best standards organization
>> the world has come up with for addressing these issues was unable to
>> produce a standard that did it.
>
> I must confess a lack of knowledge wrt to encodings, but my understanding
> is that strings are sequences of some raw data (without semantic),
> code points and glyphs.
The difference between graphemes and glyphs is the main reason for the
complications of dealing with text on computers.
A grapheme is the unit of natural text, while glyphs are the units used
for its graphical representation.
Different glyphs can represent the same grapheme (this is usually
considered a typeface difference, albeit some typefaces support multiple
glyphs for the same grapheme).
A grapheme can be represented by several glyphs (mostly diacritics).
A single glyph can represent several graphemes, with ligatures, albeit
some consider this a typeface quirk and not really a glyph, since a
glyph should be at most one grapheme.
Unicode mostly tries to encode graphemes (it doesn't encode all
variations of 'a' for example, nor all graphic variations of CJK
characters), but due to historical reasons, the whole thing is quite a mess.
A code point is therefore an element in the Unicode mapping, which
semantics depend on what that element actually is. It can be a ligature,
a diacritic, a code that is semantically equivalent to another, but not
necessarily functionally equivalent, etc.
UTF-X are then a series of encoding that describe how code points are
encoded as a series of X-sized code units.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk