Boost logo

Boost :

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Peter Dimov (pdimov_at_[hidden])
Date: 2011-01-16 12:24:02


Mathias Gaunard wrote:

> POSIX system calls expect the text they receive as char* to be encoded in
> the current character locale.

No, POSIX system calls (under most Unix OSes, except on Mac OS X) are
encoding-agnostic, they receive a null-terminated byte sequence (NTBS)
without interpreting it. On Mac OS X, file paths must be UTF-8. Locales are
not considered.

> To write cross-platform code, you need to convert your UTF-8 input to the
> locale encoding when calling system calls, and convert text you receive
> from those system calls from the locale encoding to UTF-8.

This is one possible way to do it (blindly using UTF-8 is another). Strictly
speaking, under an encoding-agnostic file system, you must not convert
anything to anything because this may cause you to irretrievably lose the
original path. For display purposes, of course, you have to pick an encoding
somehow. There is no "current" character locale on Unix, by the way, unless
you count the environment variables. The OS itself doesn't care.

Using the current C locale (LANG=...) allows you to display the file names
the same way the 'ls' command does, whereas using UTF-8 allows your user to
enter file names which are not representable in the LANG locale.

> Windows is exactly the same, except it's got two sets of locales and two
> sets of system calls.

Nope. It doesn't have two sets of locales.

> So your technique for writing independent code is relying on the user to
> use an UTF-8 locale?

More or less. The code itself doesn't depend on the user locale, it always
works, but to see the actual names in a terminal, you need an UTF-8 locale.
This is now the recommended setup on all Unix OSes.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk