Boost logo

Boost :

From: Gavin Lambert (boost_at_[hidden])
Date: 2019-09-17 06:32:14


On 17/09/2019 16:10, Vicram Rajagopalan wrote:
> I'm not too familiar with dealing with non-ASCII character encodings
> in argv. Is it portable to assume that the input is UTF-8, regardless of
> locale?

It is not.

I'm probably ignorant of several things in this area myself, but the
basic version is:

* On Windows, argv is converted to the current system codepage unless
you are using the wmain/wWinMain entrypoints to get wchar_t strings
instead. (And you should never ever use the converted values, as they
will only sometimes work, due to being a lossy conversion.) It will
never be UTF-8, but you can rely on it being UTF-16 (when using
wmain/wWinMain).

* On Unixes, argv contains whatever byte sequence the shell/caller put
there. This might be the actual filename on disk (if they used tab
completion) or it might be something subtly different (if they typed it
themselves using some kind of IME), or even a binary blob. In the first
two cases, while it is fairly *likely* to be UTF-8 (especially in modern
systems), it is not guaranteed to be -- the user could be running a
non-UTF-8 locale, or be accessing a filesystem created by someone who
was. Ideally, treat them as an opaque blob that can only be passed to
open() etc and never manipulated as text. (Obviously, this is
frequently impractical.)

So, on Windows, you must use the wchar_t as input, and while you *could*
convert this to UTF-8 for internal use you still have to convert it back
to UTF-16 to actually make use of it with the OS. Which is fine if
you're doing a lot of string manipulation (including option parsing) but
seems a bit wasteful if you're only using it as an opaque filename
token. (And if you forget to convert back to UTF-16, it may interpret
your UTF-8 string as a local-codepage-ANSI string, and hilarity ensues.)

Whereas on Linux you can often get away with assuming that it's UTF-8,
but some valid filenames will break encoder-savvy code, and any string
conversions might output a no-longer-valid filename.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk