Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Beman Dawes (bdawes_at_[hidden])
Date: 2011-01-22 17:10:10


On Fri, Jan 21, 2011 at 8:47 PM, Patrick Horgan <phorgan1_at_[hidden]> wrote:
> On 01/21/2011 09:50 AM, Beman Dawes wrote:
>>
>> ... elision by patrick ....
>>
>> IMO, Any serious Unicode string proposal has to address UTF-8 strings,
>> UTF-16 strings, UTF-32 strings, and probably UTF strings where the
>> particular UTF encoding is established at runtime. Applications that
>> deal with Asian languages, do a lot of random access, or would pay a
>> performance or storage penalty will demand more than just UTF-8
>> strings. There might be other variants, too, such as a BMP-string. If
>> a Unicode string library provides a strong design framework that is
>> clearly articulated, then an initial implementation would only have to
>> provide the most needed types; UTF-8 and UTF-16/BMP.
>>
>> I really doubt any proposal will get taken very seriously is it only
>> supports one of the UTF encodings.
>
> +1 with the caveat that UTF-8 and UTF-32 is considered by many to be the
> most needed types with UTF-16 considered evil.  (Seems to be a
> Windows/non-Windows split.  I like them all;)

IIRC, Oracle supports UTF-8 and UTF-16, so a lot of folks will want
UTF-16 for that reason. It isn't just Windows programmers.

>  So all three (four if you
> want to differentiate between fixed-width UTF-16/BMP (really UCS-2) and the
> full UTF-16) would be needed to avoid people saying that it doesn't fill
> their needs so why did we bother.

Yep.

> The UTF string with run-time would carry
> a lot of extra code.  Wouldn't a programmer know which he wanted to use
> internally at compile time?

Maybe. But I've written Geographic libraries that have to be efficient
for both North American, European, and Asian languages. It used to be
we knew at compile time what languages we would be dealing with. But
more and more because of the internet, the libraries just have to work
well everywhere. The cost of the extra code is swamped by the other
costs involved.

That said, such a string is far lower priority than the others.

> p.s. Nice quick description of the differences between and history of UCS-2
> UCS-4 utf-8 utf-16 utf-32 at
> http://en.wikipedia.org/wiki/Universal_Character_Set

Yep, recommended!

Thanks,

--Beman


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk