|
Boost : |
Subject: Re: [boost] [filesystem] Mac OS default codecvt facet
From: Peter Dimov (pdimov_at_[hidden])
Date: 2010-02-14 20:22:06
Beman Dawes wrote:
> On Sun, Feb 14, 2010 at 3:53 PM, Peter Dimov <pdimov_at_[hidden]>
> wrote:
>> Beman Dawes wrote:
>>
>>> * Is UTF-8 OK with Mac OS users as the Boost.Filesystem default?
>>
>> UTF-8 is not merely a default on Mac OS X. It's _the_ encoding used
>> by the OS.
>
> Do you have a link for that?
The most authoritative one is probably
"All BSD system functions expect their string parameters to be in UTF-8
encoding and nothing else. Code that calls BSD system routines should ensure
that the contents of all const *char parameters are in canonical UTF-8
encoding. In a canonical UTF-8 string, all decomposable characters are
decomposed; for example, é (0x00E9) is represented as e (0x0065) + ´
(0x0301). To put things into a canonical UTF-8 encoding, use the
"file-system representation" interfaces defined in Cocoa and Carbon
(including Core Foundation)."
I think that in practice the OS will take any valid UTF-8 and normalize it
internally, so it's not necessary to decompose it.
http://lists.apple.com/archives/unix-porting/2007/Sep/msg00023.html
"The kernel will reject any filename that is not a valid UTF-8 string, and
it will even be normalized (to Unicode NFD) before stored on disk, at least
when using HFS. The right way to deal with it would be to always convert the
filename to UTF-8 before trying to open/create a file."
http://lists.apple.com/archives/applescript-users/2002/Sep/msg00319.html
"How a file name looks at the API level depends on the API. Current Carbon
APIs handle file names as an array of UTF-16 characters; POSIX ones handle
them as an array of UTF-8, which is why UTF-8 works well in Terminal. How
it's stored on disk depends on the disk format; HFS+ uses UTF-16, but that's
not important in most cases."
http://developer.apple.com/mac/library/qa/qa2001/qa1173.html
"In Mac OS X's VFS API file names are, by definition, canonically decomposed
Unicode, encoded using UTF-8. This raises a number of interesting issues."
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk