Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Chad Nelson (chad.thecomfychair_at_[hidden])
Date: 2011-01-18 10:34:21

On Mon, 17 Jan 2011 23:50:18 -0800 (PST)
Artyom <artyomtnk_at_[hidden]> wrote:

>>> Also if you want to use std::codecvt facet...
>>> Don't relay on them unless you know where they come from!
>>> 1. By default they are noop - in the default C locale
>>> 2. Under most compilers they are not implemented properly. [...]
>> I was planning to use MultiByteToWideChar and its opposite under
>> Windows (which presumably would know how to translate its own code
>> pages),
> Ok...
> 1st of all I'd suggest to take a look on this code:

Pretty convoluted.

> What you would see is how painfully hard to use this functions right
> if you want to support things like skipping or replacing invalid
> characters.

Sorry for the cheap shot, but: it's Microsoft. I *expect* it to be
painful to use, from long experience. ;-)

> So if you use it, use it with SUPER care, and don't forget that
> there are changes between Windows XP and below and Windows Vista
> and above - to make your life even more interesting (a.k.a. miserable)

As you might have seen in an earlier reply this morning, I didn't
realize that it wasn't irretrievably tied to ICU; now that I know, I'd
be completely happy letting Boost.Locale handle the code-page stuff.

>> We'll have to agree to disagree there. The whole point to these
>> classes was to provide the compiler -- and the programmer using
>> them -- with some way for the string to carry around information
>> about its encoding, and allow for automatic conversions between
>> different encodings.
> This is totally different problem. If so you need container like this:
> class specially_encoded_string {
> private:
> std::string encoding_; /// <----- VERY IMPORTANT
> /// may have valies as: ASCII, Latin1,
> /// ISO-8859-8, Shift-JIS or Windows-1255
> std::string content_; /// <----- The raw string
> }

If you want arbitrary encodings, yes. If you only want a subset of the
possible encodings -- such as ASCII and the three main UTF types --
then all you need is some way to convert to and from an OS-specific

> Creating "ascii_t" container or anything that that that does
> not carry REAL encoding name with it would lead to bad things.

Certainly, if you tried to use it for stuff that isn't really in that
encoding. It wasn't meant for that.

>> If you're working with strings in multiple encodings, as I have to
>> in one of the programs we're developing, it frees up a lot of mental
>> stack space to deal with other issues.
> The best way is to conver on input encoding to internal one and use
> it, and conver it back at output.

I agree, for the most part. But if a large something comes in encoded
with a particular coding, why waste a possibly-significant amount of
processor time immediately recoding it to your internal format if you
don't know that you're going to need to do anything with it? Or if it
might well just be going out in that same external format again,
without needing to be touched? Much better to hold onto it in whatever
format it comes in, and only recode it only when you need to, in my
opinion -- if you can easily keep track of what format it's in, anyway.

> [...] 2. CppCMS: it allows using non UTF-8 encodings, but the encoding
> information is carried with std::locale::codecvt facet and I created
> and the encoding/locale is bounded to the currect request/reponse
> context. [...]

That sounds an awful lot like having a new string type that carries
around its encoding. ;-)

> These are my solutions of my real problems.
> What you suggest is misleading and not well defined.

I can see that parts of it are certainly not well defined yet, but I
believe it's a fixable problem.

Chad Nelson
Oak Circle Software, Inc.

Boost list run by bdawes at, gregod at, cpdaniel at, john at