Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-01-18 02:50:18

> From: Chad Nelson <chad.thecomfychair_at_[hidden]> > > Artyom <artyomtnk_at_[hidden]> wrote: > > >> I've done some research, and it looks like it would require little > >> effort to create an os::string_t type that uses the current locale, > >> and assume all raw std::strings that contain eight-bit values are > >> coded in that instead. > >> > >> Design-wise, ascii_t would need to change slightly after this, to > >> throw on anything that can't fit into a *seven*-bit value, rather > >> than eight-bit. I'll add the default-character option to both types > >> as well, and maybe make other improvements as I have time. > > Also if you want to use std::codecvt facet... > > Don't relay on them unless you know where they come from! > > > > 1. By default they are noop - in the default C locale > > > > 2. Under most compilers they are not implemented properly. [...] > > I was planning to use MultiByteToWideChar and its opposite under > Windows (which presumably would know how to translate its own code > pages), Ok... 1st of all I'd suggest to take a look on this code: What you would see is how painfully hard to use this functions right if you want to support things like skipping or replacing invalid characters. So if you use it, use it with SUPER care, and don't forget that there are changes between Windows XP and below and Windows Vista and above - to make your life even more interesting (a.k.a. miserable) > and mbsrtowcs and its ilk under POSIX systems (which apparently > have been well-implemented for at least seven versions under glibc [1], > though I can't tell whether eglibc -- the fork that Ubuntu uses -- has > the same level of capabilities). > > [1]: <> No... no.. This is not the way to go. For example what would be result of: #include <stdlib.h> int main() { wchar_t wide[32]; size_t size = mbsrtowcs(wide,"שלום",sizeof(wide)/sizeof(wide[0])); ??? } When current system locale lets say en_US.UTF-8? The result size would be (size_t)(-1) indicating error. You need first to use: setlocale(LC_ALL,""); To setup default locale and only then mbsrtowcs would work. And how do you think would the code below work after this (calling setlocale(...)? FILE *f=fopen("point.csv","w"); fprintf(f,"%f,%f\n",1.3,4.5); fclose(f); What would be the output? Would it succeed to create correct csv? Answer - depending on locale, for example in some locales like ru_RU.UTF-8 or Russian_Russia it would be "1,3,4,5" and not expected "1.3,4.5" Nice.. Isn't it?! And believe me 99.9% of developers would have hard to understand what is wrong with this code. You can't use these functions! Also there is other problem. What is "current locale" on current OS? - Is this defined by global OS definitions of environment variable LC_ALL, LC_CTYPE or LANG? - Is this, defined by the environment variable LC_ALL, LC_CTYPE or LANG in current user environment? - Is this, defined by the environment variable LC_ALL, LC_CTYPE or LANG in current process environment? - Is this the locale defined by setlocale(LC_ALL,"My_Locale_Name.My_Encoding"); - Is this the locale defined by std::locale::global(std::locale("My_Locale_Name.My_Encoding"))? All answers are correct and all users would probably expect each one of them to work. Don't bother to try to detect or convert to "current-locale" at POSIX system this is something that can be changed easily or even may be not defined at all! > > I hadn't wanted to add a dependency on ICU or iconv either. Though I may > end up having to for the program I'm currently developing, on at least > some platforms. > Under Unix it is more then justified to use iconv - it is standard POSIX API, in fact in Linux it is part of libc on some other platforms it may be indepenent library (like FreeBSD) Acutally in Boost.Locale is use iconv by default under Linux as it is better API then ICU's one (and faster because do not require passing via UTF-16) > > We'll have to agree to disagree there. The whole point to these classes > was to provide the compiler -- and the programmer using them -- with > some way for the string to carry around information about its encoding, > and allow for automatic conversions between different encodings. This is totally different problem. If so you need container like this: class specially_encoded_string { public: std::string encoding() const { return encoding_; } std::string to_utf8() const { return convert(content_,encoding_,"UTF-8"); } void from_utf8(std::string const &input) const { content_ = convert(input,"UTF-8",encoding_); } std::string const &raw() const { return content_; } private: std::string encoding_; /// <----- VERY IMPORTANT /// may have valies as: ASCII, Latin1, /// ISO-8859-8, Shift-JIS or Windows-1255 std::string content_; /// <----- The raw string } Creating "ascii_t" container or anything that that that does not carry REAL encoding name with it would lead to bad things. > If > you're working with strings in multiple encodings, as I have to in one > of the programs we're developing, it frees up a lot of mental stack > space to deal with other issues. The best way is to conver on input encoding to internal one and use it, and conver it back at output. I had written several programs that use different encodigns: 1. BiDiTeX: LaTeX + BiDi for Hebrew - converts input encoding to UTF-32 and then convers it back on output 2. CppCMS: it allows using non UTF-8 encodings, but the encoding information is carried with std::locale::codecvt facet and I created and the encoding/locale is bounded to the currect request/reponse context. Each user input (and BTW output as well) is validated - for example HTML form by default validates input encoding. These are my solutions of my real problems. What you suggest is misleading and not well defined. Best Regards, Artyom

Boost list run by bdawes at, gregod at, cpdaniel at, john at