Subject: Re: [boost] [filesystem]Extracting path as string from wpath
From: Ulrich Eckhardt (doomster_at_[hidden])
Date: 2008-10-19 05:17:19
On Friday 17 October 2008 17:45:28 Emil Dotchevski wrote:
> "In Mac OS X's VFS API file names are, by definition, canonically
> decomposed Unicode, encoded using UTF-8."
> This means that precomposed characters are forbidden and combining
> diacritics must be used to replace them.
> See http://developer.apple.com/qa/qa2001/qa1173.html.
Danger: read the whole document! The point is, that nothing guarantees this
encoding, it is by no means enforced by the OS. So, in order to be able to
use non-compliant media (like e.g. ones with codepage encodings, possibly
even unknown codepage encodings) you have to treat the strings received from
the filesystem as byte strings. The only things you can rely on are:
- Termination with a null byte.
- Segments separated with a path separator (i.e. '/').
Otherwise, converting it to a text string is a lossy conversion because of the
unreliable encoding (though assuming UTF-8 as a default works). Similarly,
encoding to a byte string isn't reliable, because the encoding of the
filesystem isn't guaranteed.
- A similar discussion took place on the Python developers' mailinglist.
Current state seems to be to implement both a Unicode API and one using byte
strings in parallel, though I'm not advocating that approach.
- The same problem is present on all POSIX systems (BSDs, Linux..) though
there you don't have the UTF-8 default but rather the encoding of the CTYPE
- On modern MS Windows platforms, the system actually claims to guarantee
UTF-16. Non-decodeable media are supposedly simply rejected, but I can't say
this works for sure.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk