Boost logo

Boost :

From: Rainer Deyke (rainerd_at_[hidden])
Date: 2019-09-17 11:27:24


On 17.09.19 08:32, Gavin Lambert via Boost wrote:
> * On Unixes, argv contains whatever byte sequence the shell/caller put
> there.  This might be the actual filename on disk (if they used tab
> completion) or it might be something subtly different (if they typed it
> themselves using some kind of IME), or even a binary blob.  In the first
> two cases, while it is fairly *likely* to be UTF-8 (especially in modern
> systems), it is not guaranteed to be -- the user could be running a
> non-UTF-8 locale, or be accessing a filesystem created by someone who
> was.

Or the user could be running a non-UTF-8 locale, but accessing a
filesystem created by somebody who was using UTF-8 - in which case any
filenames should be in UTF-8, even if the user's locale disagrees.

It is because of this last possibility that I recommend treating all
command-line arguments as UTF-8 on Unix systems, even if running a
non-UTF-8 locale, for all cases where treating them as binary blobs is
impractical. Unix filenames are binary blobs, but the de-facto standard
for interpreting these binary blobs as text is to use UTF-8. How can
two users, running two different locales, share a filesystem? By using
UTF-8 for all filenames, regardless of locale. How should a program
convert command-line arguments into UTF-8 filenames? By assuming that
they are already in UTF-8, because performing any kind of conversion
will cause more problems than it will fix.

-- 
Rainer Deyke (rainerd_at_[hidden])

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk