Boost logo

Boost :

Subject: Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]
From: Edward Diener (eldiener_at_[hidden])
Date: 2011-01-19 20:43:09


On 1/19/2011 6:25 PM, Brent Spillner wrote:
> On 1/19/2011 11:33 AM, Peter Dimov wrote:
>> This was the prevailing thinking once. First this number of bits was 16,
>> which incorrect assumption claimed Microsoft and Java as victims, then
>> it became 21 (or 22?). Eventually, people realized that this will never
>> happen even if we allocate 32 bits per character, so here we are.
>
> The OED lists ~600,000 words, so 32 bits is enough space to provide a
> fully pictographic alphabet for over 7,000 languages as rich as English,
> with room for a few line-drawing characters left over. Surely that's enough?

It is technically enough. In fact Unicode only uses 0x10FFF code points
in the range 0 to 0x10FFF, and a UTF-32 value will therefore not exceed
0x10FFF. So in fact UTF-32 can easily handle all of the code points in
Unicode.

But Unicode has the idea of an abstract character, which may be
represented by a more than 1 code point. Whether an abstract character
is always considered a single character, or an amalgam of a single
character ( code point ) and various formatting/graphical code points,
is probably debatable. But if one assumes that an abstract character is
a single "character" in some encoding, then the way that Unicode has
mapped out abstract characters allows for that "character" to be larger
than what will fit into a single UTF-32 encoding.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk