Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Chad Nelson (chad.thecomfychair_at_[hidden])
Date: 2011-01-17 18:30:23


On Mon, 17 Jan 2011 10:09:13 -0800 (PST)
Artyom <artyomtnk_at_[hidden]> wrote:

>> I've done some research, and it looks like it would require little
>> effort to create an os::string_t type that uses the current locale,
>> and assume all raw std::strings that contain eight-bit values are
>> coded in that instead.
>>
>> Design-wise, ascii_t would need to change slightly after this, to
>> throw on anything that can't fit into a *seven*-bit value, rather
>> than eight-bit. I'll add the default-character option to both types
>> as well, and maybe make other improvements as I have time.
>
> Unfortunately this is not the correct approach as well.
>
> For example why do you think it is safe to pass ASCII subset of utf-8
> to current non-utf-8 locale?
>
> For example Shift-JIS that is in use on Windows/ANSI API has different
> subset in 0-127 range - it is not ASCII!

Ah, I wasn't aware that there were character sets that redefined
0..127. That does change things a bit.

> Also if you want to use std::codecvt facet...
> Don't relay on them unless you know where they come from!
>
> 1. By default they are noop - in the default C locale
>
> 2. Under most compilers they are not implemented properly. [...]

I was planning to use MultiByteToWideChar and its opposite under
Windows (which presumably would know how to translate its own code
pages), and mbsrtowcs and its ilk under POSIX systems (which apparently
have been well-implemented for at least seven versions under glibc [1],
though I can't tell whether eglibc -- the fork that Ubuntu uses -- has
the same level of capabilities).

[1]: <http://www.cl.cam.ac.uk/~mgk25/unicode.html>

> Bottom lines don't relate on "current locale" :-)

I hadn't wanted to add a dependency on ICU or iconv either. Though I may
end up having to for the program I'm currently developing, on at least
some platforms.

> [...] I would strongly recommend to read the answer of Pavel
> Radzivilovsky on Stackoverflow:
>
> http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful/1855375#1855375
>
> And he is hard-core-windows-programmer, designer, architext and
> developer and still he had chosen UTF-8!

Thanks, I'm familiar with it. In fact, reading that was one of the
reasons that I started developing the utf*_t classes, so that I *could*
keep strings in UTF-8 while still keeping track of the ones that aren't.

> The problem that the issue is so completated that making it absolutly
> general and on the other hand right is only one - decide what you are
> working with and stick with it.
>
> In CppCMS project I work with (and I developed Boost.Locale because
> of it) I stick by default with UTF-8 and use plain std::string -
> works like a charm.

To each his own. :-)

> Invening "special unicode strings or storage" does not improve
> anybody's understanding of Unicode neither improve its handing.

We'll have to agree to disagree there. The whole point to these classes
was to provide the compiler -- and the programmer using them -- with
some way for the string to carry around information about its encoding,
and allow for automatic conversions between different encodings. If
you're working with strings in multiple encodings, as I have to in one
of the programs we're developing, it frees up a lot of mental stack
space to deal with other issues.

-- 
Chad Nelson
Oak Circle Software, Inc.
*
*
*



Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk