Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-01-18 06:22:43


> From: Chad Nelson <chad.thecomfychair_at_[hidden]>
> Artyom <artyomtnk_at_[hidden]> wrote:
>
> >> I've done some research, and it looks like it would require little
> >> effort to create an os::string_t type that uses the current locale,
> >> and assume all raw std::strings that contain eight-bit values are
> >> coded in that instead.
> >>
> >> Design-wise, ascii_t would need to change slightly after this, to
> >> throw on anything that can't fit into a *seven*-bit value, rather
> >> than eight-bit. I'll add the default-character option to both types
> >> as well, and maybe make other improvements as I have time.
> > Also if you want to use std::codecvt facet...
> > Don't relay on them unless you know where they come from!
> >
> > 1. By default they are noop - in the default C locale
> >
> > 2. Under most compilers they are not implemented properly. [...]
>
> I was planning to use MultiByteToWideChar and its opposite under
> Windows (which presumably would know how to translate its own code
> pages),

Ok...

1st of all I'd suggest to take a look on this code:

http://cppcms.svn.sourceforge.net/viewvc/cppcms/boost_locale/trunk/libs/locale/src/encoding/wconv_codepage.hpp?revision=1462&view=markup

What you would see is how painfully hard to use this functions right
if you want to support things like skipping or replacing invalid characters.

So if you use it, use it with SUPER care, and don't forget that
there are changes between Windows XP and below and Windows Vista
and above - to make your life even more interesting (a.k.a. miserable)

> and mbsrtowcs and its ilk under POSIX systems (which apparently
> have been well-implemented for at least seven versions under glibc [1],
> though I can't tell whether eglibc -- the fork that Ubuntu uses -- has
> the same level of capabilities).
>
> [1]: <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
>

This is the code that converts between encodings and usesds

> > Bottom lines don't relate on "current locale" :-)
>
> I hadn't wanted to add a dependency on ICU or iconv either. Though I may
> end up having to for the program I'm currently developing, on at least
> some platforms.
>
> > [...] I would strongly recommend to read the answer of Pavel
> > Radzivilovsky on Stackoverflow:
> >
> >
>http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful/1855375#1855375
>
>
> >
> > And he is hard-core-windows-programmer, designer, architext and
> > developer and still he had chosen UTF-8!
>
> Thanks, I'm familiar with it. In fact, reading that was one of the
> reasons that I started developing the utf*_t classes, so that I *could*
> keep strings in UTF-8 while still keeping track of the ones that aren't.
>
> > The problem that the issue is so completated that making it absolutly
> > general and on the other hand right is only one - decide what you are
> > working with and stick with it.
> >
> > In CppCMS project I work with (and I developed Boost.Locale because
> > of it) I stick by default with UTF-8 and use plain std::string -
> > works like a charm.
>
> To each his own. :-)
>
> > Invening "special unicode strings or storage" does not improve
> > anybody's understanding of Unicode neither improve its handing.
>
> We'll have to agree to disagree there. The whole point to these classes
> was to provide the compiler -- and the programmer using them -- with
> some way for the string to carry around information about its encoding,
> and allow for automatic conversions between different encodings. If
> you're working with strings in multiple encodings, as I have to in one
> of the programs we're developing, it frees up a lot of mental stack
> space to deal with other issues.
> --
> Chad Nelson
> Oak Circle Software, Inc.
> *
> *
> *
>

      


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk