Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2011-01-16 11:04:11


On 15/01/2011 15:46, Artyom wrote:

> No you don't need convert UTF-8 to "locales" encoding as char* is native
> system API unlike Windows one. So you don't need to mess around with encodings
> at all unless you deal with text related stuff like for example collation.

POSIX system calls expect the text they receive as char* to be encoded
in the current character locale.

To write cross-platform code, you need to convert your UTF-8 input to
the locale encoding when calling system calls, and convert text you
receive from those system calls from the locale encoding to UTF-8.
(Note: this is exactly what gtkmm::ustring does)

Windows is exactly the same, except it's got two sets of locales and two
sets of system calls.

The wide character locale is more interesting since it is always UTF-16,
so the conversion you have to do is only between UTF-8 and UTF-16, which
is easy and lossless.

Likewise, you could also choose to use UTF-16 or UTF-32 as your internal
representation rather than UTF-8. The choice is completely irrelevant
which regards to providing an uniformly encoded interface regardless of
platform.

> The problem is not locales, encodings or other stuff, the problem
> is that Windows API does not allow you to use "char *" based
> string fully as it does not support UTF-8

The actual locale used by the user is irrelevant.

Again, as I said earlier, the fact that UTF-8 is the most common locale
on Linux but is not available on Windows shouldn't affect the way the
system works.

A lot of Linux systems use a Latin-1 locale, and your approach will
simply fail on those systems.

> and platform independent
> programming becomes total mess.

So your technique for writing independent code is relying on the user to
use an UTF-8 locale?


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk