Subject: Re: [boost] [locale] Formal review of Boost.Locale library EXTENDED
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-04-21 00:58:59
> From: Ryou Ezoe <boostcpp_at_[hidden]>
> On Wed, Apr 20, 2011 at 11:06 PM, Artyom <artyomtnk_at_[hidden]> wrote:
> >> I don't understand what are you trying to solve by that so called
> >> Solution A does not work at all.
> >> There is no guarantee ordinary string literal is UTF-8 encoded.(and it
> >> always isn't in MSVC).
> > You are right, I missed it.
> > After digging a little I've now surprised how
> > broken MSVC's behavior is $%&#%$^&#%^#%^#$%^#$%^#$%^
> > Seriously...
> > - Without BOM I can't get L"æ¥æ¬èª" as UTF-16 to work when
> > the sources are UTF-8 (however char * is valid utf-8),
> > It works only when the sources are in the current
> > locale's encoding that actually may vary from
> > PC to PC so it is totally unreliable.
> It doesn't work because there is no way to detect what character
> encoding the source file use.
> It can be anything.
> See the list.
Exactly, that means that same source code compiled on two different
machines can lead to two different results.
> Assuming the character set according to system's default locale is not
> a good idea.
> But there is no good way to handle it anyway.
Actually there is #pragma setlocale
But too few know it and it is MSVC specific.
> Even the std::locale does the same thing, isn't it?
> Set a default global locale and assume it's the correct locale.
This is different, for example when you start text editor
on my PC where the system/user locale is he_IL.UTF-8 that it
is more then reasonable to start the user interface
in Hebrew while on your PC (I assume Japanese_Japan.932)
it should have Japanese interface.
However unlike your (Windows PC) the narrow encoding
is different, while on my (Linux PC) it is UTF-8
so no encoding change is required.
It was one of the goals of this library as it by
default choses UTF-8 as narrow encoding
(unless it was told to select so called ANSI)
So it allows you with help of std::locale
aware tools like boost::filesystem to
develop cross platform Unicode aware software.
(And don't suggest using Wide strings as they
quite useless for cross platform development,
they useful on Windows as it has Wide API as
major API but not more then that)
> > - With BOM while I get L"æ¥æ¬èª" as UTF-16...
> > I can't get "æ¥æ¬èª" as UTF-8 at all it still
> > formats them in current locale, which is
> > unreliable as well.
> Windows does not truly support UTF-8 locale.(Native locale is UTF-16)
> Because MBCS can be anything.
> You can't detect it no matter what clever hack you use.
See pragma setlocale above.
> Maybe UTF-8,
> maybe one of this encoding in the list excluding UTF-16LE and BE.
> So it needs a hint, a BOM.
Yes, unfortunately you should put BOM that all compilers
in the world complain about.
I'm aware of the way Windows thinks of encodings.
I just mistakenly assumed that with BOM "×©×××" and L"×©×××"
would get me UTF-8 and UTF-16 string under MSVC, but I
> > But you know what? It just convinces me
> > even more that ASCII strings should be
> > used as keys with all this nonsense
> > with MSVC compiler.
> No. You must use one of UCS encodings.
But, MSVC does not know to handle UCS encodings :-)
I mean you can't get both UTF-8 string and UTF-16 string
under MSVC in same sources.
> >> Solution B... What are you doing?
> >> Isn't wxtranslate(WTR("æ¥æ¬èª")) ended up pointer to const wchar_t that
> >> refers to L"æ¥æ¬èª" ?
> >> It does nothing except it works as a macro which add L encoding prefix.
> >> If so, I'd rather write L"æ¥æ¬èª" directly.
> > I see why the second case does not work unless your
> > "ids" are in Shift-JIS.
> Again, Using CP932(Microsoft Variant) is not a solution.
> It's workaround.
Agree but using Japanese as source string encoding is
workaround of programmes lack of English knowledge,
don't you really expect that French, German, Hebrew,
Arabic or Greek developers would all use their
own language in the sources?
> I think I need to explain a bit of history about Japanese character encoding.
> Shift-JIS was designed based on two JIS standard encodings when
> Unicode is not widely used and not a practical solution.
I know this story, and I most telling Shift-JIS at it is more clear
then to say cp932...
> Every character encoding has its own history. A long history.
> You can't ignore it. These encodings are still widely used.
> If you stick with the ASCII, expect any extended ASCII variants.
When I tell ASCII I should probably say US-ASCII not extended
ones. Extended ASCII variants should be vanished, including
ISO-8859-8, Windows-1255 (hebrew encodings) and other
encodings like Latin1, JIS and others in favor of UTF-8.
> > Any way, the things you say even increases the total
> > mess exists in current charset encoding.
> > From all compilers I used (MSVC, GCC, Intel, SunCC, OpenVMS's HP)
> > all but MSVC handle UTF-8/wide/narrow characters properly.
> What do you mean properly?
> Windows does not support UTF-8 locale. Everything is handled as UTF-16.
> Wide and narrow characters can be anything.
> So any encodings are proper implementation of the standard.
I mean "×©×××" and L"×©×××" would be UTF-8 and UTF-16 or 32
as in all C and C++ compilers on the Earth.
> > I'm sorry I was thinking about it too good things.
> > This way or other it convinces me that you
> > can relay only on ASCII.
> I you think it that way. You probably shouldn't design a localization
> Don't you think it's odd?
I don't want to open philosofical discussions
about what design desigions had Microsoft did,
but some of them really bad onces, and
that is why Boost.Locale uses UTF-8 encoding
by default under MS Windows (unless you
explicitly say it to select local ANSI encoding)
This library tries to do the best supporting
both Wide and Narrow characters but at
some points desisions should be made
to one direction or others, because otherwise
we would stay behind.
I understand that you are Windows developer
that is familiar with Wide character
However Boost.Locale is cross platform system,
and it can't be based on Wide API because
it is useless outside Microsoft Windows.
So yes, it choses to stick to portable
desigision like using char * as string id
or selecting UTF-8 and default encoding,
because otherwise it would not bring us
Also given a fact that "char *" is total
mess in terms of encoding as well as
"wchar_t *" is even more messy in
terms of source files encodings
then I think the design to stick
to "char *" and to US-ASCII
as ids is right one.
> That is, in order to use UCS, we have to use ASCII.
> UCS is not perfect. If we were to design it from scratch(including
> drop compatibility with ASCII), we can design it better.
> Nobody use such standard though.
> UCS is the only standard an encoding(either UTF-8, UTF-16, UTF-32) can
> represent all well-known glyphs in the world today
> You prefer UTF-8 because, fundamentally, you're thinking in ASCII.
> UTF-8 is ASCII compatible. ASCII is NOT UTF-8 compatible.
You probably confused between two things:
a) Every US-ASCII string is UTF-8 string
b) Only subset of UTF-8 strings is US-ASCII
Also I don't think ASCII, I think portability,
and UTF-8 is much more portable then UTF-16
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk