Boost logo

Boost :

Subject: Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World
From: Artyom Beilis (artyomtnk_at_[hidden])
Date: 2012-01-29 10:14:57


----- Original Message ----- > From: Mathias Gaunard <mathias.gaunard_at_[hidden]> >> Not, MSVC does not allow to create both "שלום" and > L"שלום" literal >> as Unicode (utf-8, UTF-16) for all other compilers it is default >> behavior. > > And it shouldn't. It depends on the point of view. (see below) > String literals are in the execution character set. On Windows the execution > character set is what it calls ANSI. That much is not going to change. > Execution character set is host dependent and ANSI code page differs from one host to another. When you compile the program on one host with one character set the program will not behave correctly on other host. This is a huge bug in C++ design. That is why it should be fixed. Most compilers around already did this... It can be done in backward compatible way by requiring compilation time option and deprecating the concept of "execution character set" > >>>>   1. BOM should not be used in source code, no compiler except MSVC > uses it >>> and most >>>>       do not support it. >>> >>> According to Yakov, GCC supports it now. >>> It would be nice if it could work without any BOM though. >>> >> >> GCC's default input and literal encoding is UTF-8. BOM is not needed. > > That's not what I'm saying. What we want is a unified way to set UTF-8 > as the source character set. The problem is that MSVC requires BOM, but GCC > used to not allow it. The problem is not BOM or not BOM. BOM is not way to fix the problem. All concept of "BOM" to distinguish between ANSI encoding and UTF-8 exist only on Windows. It is not portable and most importantly stupid thing to provide "Byte-Order-Mark" for UTF-8 that does not have byte order. GCC provides a flag to specify encoding, AFAIR most of other compilers do the same. >>>>   2. Setting UTF-8 BOM makes narrow literals to be encoded in ANSI > encoding >>> which >>>>       makes BOM useless (crap... sory) with MSVC even more. >>> >>> That's the correct behaviour. >> >> No, it is unspecified behavior according to the standard. > > It isn't. > It is, because host character set is not well defined and it varies from host to host. So the result is just not specified. I'll make it more clear: **It is not well defined**. > >> Standard does not specify what narrow encoding should be used, that >> is why u8"" was created. > > The standard specifies that it is the execution character set. MSVC specifies > that for its implementation, the execution character set is ANSI. > So may be standard should add an option to specific the input character set explicitly so it would not vary from host to host? > >> All (but MSVC) compilers create UTF-8 literals and use UTF-8 input >> and this is the default. > > That's because for those other compilers, you are in a case where the source > character set is the same as the execution character set. > GCC allows to specify both "" literal encoding and input encodings. -finput-charset and -fexec-charset options. > With MSVC, if you don't do anything, both your source and execution > character sets are ANSI. If you set your source character set to UTF-8, your > execution character set remains ANSI still. > No it will remain the original ANSI encoding that may not much the host ANSI encoding. Input CP-XXX  "test" -> literal CP-XXX "test" < - execution charset But in runtime it would be CP-YYY != CP-XXX > On non-Windows platforms, UTF-8 is the most common execution character set, so > you can have a setup where source = execution = UTF-8, but you can't do that > on Windows. > But that is irrelevant to the standard. As I tould you standard should specify a way to define both execution and input character set. > > >>> Use u8 string literals if you want UTF-8. >> >> Why on earth should I do this? > > Because it makes perfect sense and it's the way it's supposed to work. > Except that it does not solve any real problem. > >> All the world around uses UTF-8. Why should I specifiy u8"" if it > is >> something that can be easily defined at compiler level? > > Because otherwise you're not independent from the execution character set. > Writing you program with Unicode allows you to not depend on platform-specific > encodings, that doesn't mean it makes them go away. > I remind UTF-8 is Unicode... > I repeat, narrow string literals are and will remain in the execution character > set. Expecting those to end up as UTF-8 data is wrong and not portable. > I thing it is a bug in a design and the programmer should be able to override it. Finally the "execution" character set is meaningless as it host dependent, the "narrow-literal" character set is meaningful. > >> All we need is some flag for MSVC that tells that string >> literals encoding is UTF-8. > > That "flag" is using the u8 prefix on those string literals. > Remember: the encoding used for the data in a string literal is independent from > the encoding used to write the source. > I know > >> AFAIR, neither gcc4.6 nor msvc10 supports u8"". > > Unicode string literals have been in GCC since 4.5. > AFAIR GCC supports u"" and U"" when I checked u8"" it was not working but I may be wrong. Artyom Beilis >


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk