Boost logo

Boost :

Subject: Re: [boost] [locale] Formal review of Boost.Locale library EXTENDED
From: Matus Chochlik (chochlik_at_[hidden])
Date: 2011-04-20 03:47:04


On Wed, Apr 20, 2011 at 3:54 AM, Edward Diener <eldiener_at_[hidden]> wrote:
> On 4/19/2011 9:05 AM, Artyom wrote:
>>>
>>> From: Edward Diener<eldiener_at_[hidden]>
>>> On 4/19/2011 3:17 AM, Matus Chochlik wrote:
>>>>
[snip/]
>>>>
>>>> Take your pick :-)
>>>
>>> My pick is to use  what the language currently provides,
>>> which is wchar_t, which can represent  UTF-16,
>>
>> No it can not represent UTF-16, it represents
>> either UTF-32 (on most platforms around) or UTF-16
>> (on one specific platform Microsoft Windows).
>
> Then clearly it can represent UTF-16.
>
>>
>>> a popular Unicode variant which also happens to be the standard for wide
>>> characters on Windows,
>>> which just happens to be the dominant operating system in  the world ( by
>>> alot
>>> ) in terms of end-users.

*Only* on a single platform (which you claim to be the most dominant,
which is only partially true). On other platforms wchar_t
does not represent UTF-16. Actually the situation with wchar_t is only
slightly better than with char because just as the standard does not
specify what encoding the char-based strings use is also does not
specify the encoding for wchar_t.

And wchar_t using UTF-16 on Windows is no standard is a custom.
I still remember times when wchar_t used to be UCS2.

[snip/]
>>
>> I bake your pardon?
>
> Apologies. I should not have said that you have a closed mind about this
> issue.
>
>>
>> UTF-8 is standard far beyond what
>> Linux is uses.
>
> A standard for what ? There are three Unicode character sets which are
> generally used, UTF-8, UTF-16, and UTF-32. I can not understand why you
> think one of them is some sort of standard for something.

Then look which of those encodings is "dominant" on the Web (HTML pages,
PHP scripts, template files for various CMS', CSS files, WSDL files, ...),
in various database systems (most of those adopting wchar_t and UTF-16
(USC2 really) had quite a lot of problems because of this, just as Windows
had at some point) or in XML files in general which are used basically
everywhere.

Just have a look what encoding use the XML files that are zipped inside
the *Microsoft* Office's documents (docx, xlsx, pptx, etc.), Hint, no
it's not UTF-16.

But the most important reason why I think that UTF-8 is superior
to UTF-16/32 is, that it is the only truly portable format. Yes, you can
use UTF-16 or UTF-32 as your internal representation of characters
on a certain platform, but if you expect to publish the data and move it
to other computers then the only rational thing to do is to use UTF-8,
where you don't have to deal with *stupid* byte ordering marks
nor any other similar nonsense.

I think that it about time that we pick a single character set and a single
encoding because if we don't we are back at the point where we started
30-40 year ago. We just won't have ISO-8859-X, CP-XYZ, ... but UTF-XY,
UCS-N, ... instead.

And actually I think that the usually highly overrated "invisible
hand of the market" had done the right thing this time and already
picked the "best" encoding for us (see above).
Even Microsoft will wake up one of those days and accept this.
The fact that they already use UTF-8 in their own document formats
is IMO a proof of that.

>>
>> All systems use English as the base as
>> it is the best practice.
>
> That's a pretty bald statement buy I do not think it is true. But even if it
> were, why should everybody doing something one way be proof that that way is
> best ? I would much rather pursue a technical solution that I felt was best
> even if no one else thought so.

I do not think that it is bald (nor bold ;-)). Using the basic character
set and English has one big advantage: You won't have problems
with Unicode normalization. But, for people willing to take the risks
of their code being unportable, etc. I don't see why we could not
add another overload for translate which would accept wchar-based
strings, somehow "detect" the encoding and convert it to UTF-8
for the backend (gettext in this case) if necessary and return
a wide string with the translation.
I don't mind if other people want to risk shooting themselves
in the foot if that is their own free decision :-) This will be temporary
anyway because the UTF-8 literals are already coming.

[snip/]

Matus


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk