Boost logo

Boost :

Subject: Re: [boost] [locale] Formal review of Boost.Locale library EXTENDED
From: Ryou Ezoe (boostcpp_at_[hidden])
Date: 2011-04-20 03:58:22


Why some people thinks one encoding of UCS is better than others.

On Wed, Apr 20, 2011 at 4:47 PM, Matus Chochlik <chochlik_at_[hidden]> wrote:
> On Wed, Apr 20, 2011 at 3:54 AM, Edward Diener <eldiener_at_[hidden]> wrote:
>> On 4/19/2011 9:05 AM, Artyom wrote:
>>>>
>>>> From: Edward Diener<eldiener_at_[hidden]>
>>>> On 4/19/2011 3:17 AM, Matus Chochlik wrote:
>>>>>
> [snip/]
>>>>>
>>>>> Take your pick :-)
>>>>
>>>> My pick is to use  what the language currently provides,
>>>> which is wchar_t, which can represent  UTF-16,
>>>
>>> No it can not represent UTF-16, it represents
>>> either UTF-32 (on most platforms around) or UTF-16
>>> (on one specific platform Microsoft Windows).
>>
>> Then clearly it can represent UTF-16.
>>
>>>
>>>> a popular Unicode variant which also happens to be the standard for wide
>>>> characters on Windows,
>>>> which just happens to be the dominant operating system in  the world ( by
>>>> alot
>>>> ) in terms of end-users.
>
> *Only* on a single platform (which you claim to be the most dominant,
> which is only partially true). On other platforms wchar_t
> does not represent UTF-16. Actually the situation with wchar_t is only
> slightly better than with char because just as the standard does not
> specify what encoding the char-based strings use is also does not
> specify the encoding for wchar_t.
>
> And wchar_t using UTF-16 on Windows is no standard is a custom.
> I still remember times when wchar_t used to be UCS2.
>
> [snip/]
>>>
>>> I bake your pardon?
>>
>> Apologies. I should not have said that you have a closed mind about this
>> issue.
>>
>>>
>>> UTF-8 is standard far beyond what
>>> Linux is uses.
>>
>> A standard for what ? There are three Unicode character sets which are
>> generally used, UTF-8, UTF-16, and UTF-32. I can not understand why you
>> think one of them is some sort of standard for something.
>
> Then look which of those encodings is "dominant" on the Web (HTML pages,
> PHP scripts, template files for various CMS', CSS files, WSDL files, ...),
> in various database systems (most of those adopting wchar_t and UTF-16
> (USC2 really) had quite a lot of problems because of this, just as Windows
> had at some point) or in XML files in general which are used basically
> everywhere.
>
> Just have a look what encoding use the XML files that are zipped inside
> the *Microsoft* Office's documents (docx, xlsx, pptx, etc.), Hint, no
> it's not UTF-16.
>
> But the most important reason why I think that UTF-8 is superior
> to UTF-16/32 is, that it is the only truly portable format. Yes, you can
> use UTF-16 or UTF-32 as your internal representation of characters
> on a certain platform, but if you expect to publish the data and move it
> to other computers then the only rational thing to do is to use UTF-8,
> where you don't have to deal with *stupid* byte ordering marks
> nor any other similar nonsense.
>
> I think that it about time that we pick a single character set and a single
> encoding because if we don't we are back at the point where we started
> 30-40 year ago. We just won't have ISO-8859-X, CP-XYZ, ... but UTF-XY,
> UCS-N, ... instead.
>
> And actually I think that the usually highly overrated "invisible
> hand of the market" had done the right thing this time and already
> picked the "best" encoding for us (see above).
> Even Microsoft will wake up one of those days and accept this.
> The fact that they already use UTF-8 in their own document formats
> is IMO a proof of that.
>
>>>
>>> All systems use English as the base as
>>> it is the best practice.
>>
>> That's a pretty bald statement buy I do not think it is true. But even if it
>> were, why should everybody doing something one way be proof that that way is
>> best ? I would much rather pursue a technical solution that I felt was best
>> even if no one else thought so.
>
> I do not think that it is bald (nor bold ;-)). Using the basic character
> set and English has one big advantage: You won't have problems
> with Unicode normalization. But, for people willing to take the risks
> of their code being unportable, etc. I don't see why we could not
> add another overload for translate which would accept wchar-based
> strings, somehow "detect" the encoding and convert it to UTF-8
> for the backend (gettext in this case) if necessary and return
> a wide string with the translation.
> I don't mind if other people want to risk shooting themselves
> in the foot if that is their own free decision :-) This will be temporary
> anyway because the UTF-8 literals are already coming.
>
> [snip/]
>
> Matus
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- 
Ryou Ezoe

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk