Boost logo

Boost :

Subject: Re: [boost] [locale] Formal review of Boost.Locale library EXTENDED
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-04-21 00:58:59


> From: Ryou Ezoe <boostcpp_at_[hidden]> > > On Wed, Apr 20, 2011 at 11:06 PM, Artyom <artyomtnk_at_[hidden]> wrote: > >> I don't understand what are you trying to solve by that so called > solutions. > >> > >> Solution A does not work at all. > >> There is no guarantee ordinary string literal is UTF-8 encoded.(and it > >> always isn't in MSVC). > >> > > > > You are right, I missed it. > > > > After digging a little I've now surprised how > > broken MSVC's behavior is $%&#%$^&#%^#%^#$%^#$%^#$%^ > > > > Seriously... > > > > - Without BOM I can't get L"日本語" as UTF-16 to work when > > the sources are UTF-8 (however char * is valid utf-8), > > > > It works only when the sources are in the current > > locale's encoding that actually may vary from > > PC to PC so it is totally unreliable. > > It doesn't work because there is no way to detect what character > encoding the source file use. > It can be anything. > See the list. > http://msdn.microsoft.com/en-us/library/dd317756(v=vs.85).aspx > Exactly, that means that same source code compiled on two different machines can lead to two different results. > Assuming the character set according to system's default locale is not > a good idea. > But there is no good way to handle it anyway. > Actually there is #pragma setlocale http://msdn.microsoft.com/en-us/library/3e22ty2t(v=VS.100).aspx But too few know it and it is MSVC specific. > Even the std::locale does the same thing, isn't it? > Set a default global locale and assume it's the correct locale. > This is different, for example when you start text editor on my PC where the system/user locale is he_IL.UTF-8 that it is more then reasonable to start the user interface in Hebrew while on your PC (I assume Japanese_Japan.932) it should have Japanese interface. However unlike your (Windows PC) the narrow encoding is different, while on my (Linux PC) it is UTF-8 so no encoding change is required. It was one of the goals of this library as it by default choses UTF-8 as narrow encoding (unless it was told to select so called ANSI) So it allows you with help of std::locale aware tools like boost::filesystem to develop cross platform Unicode aware software. (And don't suggest using Wide strings as they quite useless for cross platform development, they useful on Windows as it has Wide API as major API but not more then that) > > > > - With BOM while I get L"日本語" as UTF-16... > > I can't get "日本語" as UTF-8 at all it still > > formats them in current locale, which is > > unreliable as well. > > Windows does not truly support UTF-8 locale.(Native locale is UTF-16) > Because MBCS can be anything. > You can't detect it no matter what clever hack you use. See pragma setlocale above. > Maybe UTF-8, > maybe one of this encoding in the list excluding UTF-16LE and BE. > http://msdn.microsoft.com/en-us/library/dd317756(v=vs.85).aspx > > So it needs a hint, a BOM. > Yes, unfortunately you should put BOM that all compilers in the world complain about. I'm aware of the way Windows thinks of encodings. I just mistakenly assumed that with BOM "שלום" and L"שלום" would get me UTF-8 and UTF-16 string under MSVC, but I was wrong. > > But you know what? It just convinces me > > even more that ASCII strings should be > > used as keys with all this nonsense > > with MSVC compiler. > > No. You must use one of UCS encodings. But, MSVC does not know to handle UCS encodings :-) I mean you can't get both UTF-8 string and UTF-16 string under MSVC in same sources. > > > >> > >> Solution B... What are you doing? > >> Isn't wxtranslate(WTR("日本語")) ended up pointer to const wchar_t that > >> refers to L"日本語" ? > >> It does nothing except it works as a macro which add L encoding prefix. > >> If so, I'd rather write L"日本語" directly. > >> > > > > I see why the second case does not work unless your > > "ids" are in Shift-JIS. > Again, Using CP932(Microsoft Variant) is not a solution. > It's workaround. > Agree but using Japanese as source string encoding is workaround of programmes lack of English knowledge, don't you really expect that French, German, Hebrew, Arabic or Greek developers would all use their own language in the sources? > I think I need to explain a bit of history about Japanese character encoding. > Shift-JIS was designed based on two JIS standard encodings when > Unicode is not widely used and not a practical solution. > [snip] I know this story, and I most telling Shift-JIS at it is more clear then to say cp932... > > Every character encoding has its own history. A long history. > You can't ignore it. These encodings are still widely used. > > If you stick with the ASCII, expect any extended ASCII variants. When I tell ASCII I should probably say US-ASCII not extended ones. Extended ASCII variants should be vanished, including ISO-8859-8, Windows-1255 (hebrew encodings) and other encodings like Latin1, JIS and others in favor of UTF-8. > > > > Any way, the things you say even increases the total > > mess exists in current charset encoding. > > > > From all compilers I used (MSVC, GCC, Intel, SunCC, OpenVMS's HP) > > all but MSVC handle UTF-8/wide/narrow characters properly. > > What do you mean properly? > Windows does not support UTF-8 locale. Everything is handled as UTF-16. > > Wide and narrow characters can be anything. > So any encodings are proper implementation of the standard. > I mean "שלום" and L"שלום" would be UTF-8 and UTF-16 or 32 as in all C and C++ compilers on the Earth. > > > > > I'm sorry I was thinking about it too good things. > > > > This way or other it convinces me that you > > can relay only on ASCII. > > I you think it that way. You probably shouldn't design a localization library. > Don't you think it's odd? I don't want to open philosofical discussions about what design desigions had Microsoft did, but some of them really bad onces, and that is why Boost.Locale uses UTF-8 encoding by default under MS Windows (unless you explicitly say it to select local ANSI encoding) This library tries to do the best supporting both Wide and Narrow characters but at some points desisions should be made to one direction or others, because otherwise we would stay behind. I understand that you are Windows developer that is familiar with Wide character API. However Boost.Locale is cross platform system, and it can't be based on Wide API because it is useless outside Microsoft Windows. So yes, it choses to stick to portable desigision like using char * as string id or selecting UTF-8 and default encoding, because otherwise it would not bring us forward. Also given a fact that "char *" is total mess in terms of encoding as well as "wchar_t *" is even more messy in terms of source files encodings then I think the design to stick to "char *" and to US-ASCII as ids is right one. > That is, in order to use UCS, we have to use ASCII. > UCS is not perfect. If we were to design it from scratch(including > drop compatibility with ASCII), we can design it better. > Nobody use such standard though. > UCS is the only standard an encoding(either UTF-8, UTF-16, UTF-32) can > represent all well-known glyphs in the world today > > You prefer UTF-8 because, fundamentally, you're thinking in ASCII. > UTF-8 is ASCII compatible. ASCII is NOT UTF-8 compatible. > You probably confused between two things: a) Every US-ASCII string is UTF-8 string b) Only subset of UTF-8 strings is US-ASCII Also I don't think ASCII, I think portability, and UTF-8 is much more portable then UTF-16 or UTF-32. Best, Artyom


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk