Boost logo

Boost :

Subject: Re: [boost] [locale] Formal review of Boost.Locale library EXTENDED
From: Ryou Ezoe (boostcpp_at_[hidden])
Date: 2011-04-20 19:55:17


On Wed, Apr 20, 2011 at 11:06 PM, Artyom <artyomtnk_at_[hidden]> wrote:
>> I don't  understand what are you trying to solve by that so called  solutions.
>>
>> Solution A does not work at all.
>> There is no guarantee  ordinary string literal is UTF-8 encoded.(and it
>> always isn't in  MSVC).
>>
>
> You are right, I missed it.
>
> After digging a little I've now surprised how
> broken MSVC's behavior is $%&#%$^&#%^#%^#$%^#$%^#$%^
>
> Seriously...
>
> - Without BOM I can't get L"日本語" as UTF-16 to work when
>  the sources are UTF-8 (however char * is valid utf-8),
>
>  It works only when the sources are in the current
>  locale's encoding that actually may vary from
>  PC to PC so it is totally unreliable.

It doesn't work because there is no way to detect what character
encoding the source file use.
It can be anything.
See the list.
http://msdn.microsoft.com/en-us/library/dd317756(v=vs.85).aspx

Assuming the character set according to system's default locale is not
a good idea.
But there is no good way to handle it anyway.

Even the std::locale does the same thing, isn't it?
Set a default global locale and assume it's the correct locale.

>
> - With BOM while I get L"日本語" as UTF-16...
>  I can't get "日本語" as UTF-8 at all it still
>  formats them in current locale, which is
>  unreliable as well.

Windows does not truly support UTF-8 locale.(Native locale is UTF-16)
Because MBCS can be anything.
You can't detect it no matter what clever hack you use. Maybe UTF-8,
maybe one of this encoding in the list excluding UTF-16LE and BE.
http://msdn.microsoft.com/en-us/library/dd317756(v=vs.85).aspx

So it needs a hint, a BOM.

When it says MBCS and ANSI and Japanese, It's always CP932(Microsoft
variant of Shift-JIS).
So MSVC, the compiler for Windows, use that encoding for Japanese.

>
> But you know what? It just convinces me
> even more that ASCII strings should be
> used as keys with all this nonsense
> with MSVC compiler.

No. You must use one of UCS encodings.
>
>>
>> Solution B... What are you doing?
>> Isn't  wxtranslate(WTR("日本語")) ended up pointer to const wchar_t that
>> refers to  L"日本語" ?
>> It does nothing except it works as a macro which add L encoding  prefix.
>> If so, I'd rather write L"日本語" directly.
>>
>
> I see why the second case does not work unless your
> "ids" are in Shift-JIS.
Again, Using CP932(Microsoft Variant) is not a solution.
It's workaround.

I think I need to explain a bit of history about Japanese character encoding.
Shift-JIS was designed based on two JIS standard encodings when
Unicode is not widely used and not a practical solution.
They needed a encoding for Japanese and they needed it fast.
There were two JIS standard encodings(kana as extended ASCII encoding
and kanji code points) already. But it was hard to use.
JIS encoding was a statefull encoding.
It use escape sequences to change the behavior of how to interpret
rest of the characters.

blah blah blah [escape sequence to change the mode] blah blah blah
[escape sequence to change it back] blah blah blah

The meaning of the code point changes base from an escape sequence
that appears prior to that code point.
This is really hard to use.
Designing a new encoding needs time so they had to slightly modify
this encoding.
In order to remove the escape sequence, they shifted some code points
to squeeze characters to unused range.
Since it just use binary shift, it's easy to get a original code point
from shift-jis(if they know it is indeed a shift-jis).
Thus, the name *shift* JIS.

EUC-JP has another story. But I don't know EUC-JP well.

Every character encoding has its own history. A long history.
You can't ignore it. These encodings are still widely used.

If you stick with the ASCII, expect any extended ASCII variants.
Because that's what these extended ASCII were meant to be.
Give it to the code which expect ASCII and, because of its
compatibility with ASCII, it works most of the time.
It's not perfect. But you can use it from today. That's what happened
in the last century.

>
> Any way, the things you say even increases the total
> mess exists in current charset encoding.
>
> From all compilers I used (MSVC, GCC, Intel, SunCC, OpenVMS's HP)
> all but MSVC handle UTF-8/wide/narrow characters properly.
What do you mean properly?
Windows does not support UTF-8 locale. Everything is handled as UTF-16.

Wide and narrow characters can be anything.
So any encodings are proper implementation of the standard.

>
> I'm sorry I was thinking about it too good things.
>
> This way or other it convinces me that you
> can relay only on ASCII.

I you think it that way. You probably shouldn't design a localization library.
Don't you think it's odd?
That is, in order to use UCS, we have to use ASCII.
UCS is not perfect. If we were to design it from scratch(including
drop compatibility with ASCII), we can design it better.
Nobody use such standard though.
UCS is the only standard an encoding(either UTF-8, UTF-16, UTF-32) can
represent all well-known glyphs in the world today

You prefer UTF-8 because, fundamentally, you're thinking in ASCII.
UTF-8 is ASCII compatible. ASCII is NOT UTF-8 compatible.

>
> Artyom
>
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- 
Ryou Ezoe

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk