Boost logo

Boost :

Subject: Re: [boost] [filesystem]Extracting path as string from wpath
From: Beman Dawes (bdawes_at_[hidden])
Date: 2008-10-20 08:43:29


On Sun, Oct 19, 2008 at 5:17 AM, Ulrich Eckhardt <doomster_at_[hidden]> wrote:

> On Friday 17 October 2008 17:45:28 Emil Dotchevski wrote:
> > "In Mac OS X's VFS API file names are, by definition, canonically
> > decomposed Unicode, encoded using UTF-8."
> >
> > This means that precomposed characters are forbidden and combining
> > diacritics must be used to replace them.
> >
> > See http://developer.apple.com/qa/qa2001/qa1173.html.
>
> Danger: read the whole document! The point is, that nothing guarantees this
> encoding, it is by no means enforced by the OS. So, in order to be able to
> use non-compliant media (like e.g. ones with codepage encodings, possibly
> even unknown codepage encodings) you have to treat the strings received
> from
> the filesystem as byte strings. The only things you can rely on are:
> - Termination with a null byte.
> - Segments separated with a path separator (i.e. '/').
>
> Otherwise, converting it to a text string is a lossy conversion because of
> the
> unreliable encoding (though assuming UTF-8 as a default works). Similarly,
> encoding to a byte string isn't reliable, because the encoding of the
> filesystem isn't guaranteed.
>
> BTW:
> - A similar discussion took place on the Python developers' mailinglist.
> Current state seems to be to implement both a Unicode API and one using
> byte
> strings in parallel, though I'm not advocating that approach.
> - The same problem is present on all POSIX systems (BSDs, Linux..) though
> there you don't have the UTF-8 default but rather the encoding of the CTYPE
> locale.

Yes. The situation on POSIX systems is quite messy. I've been discussing it
with the POSIX folks, and get conflicting answers depending on the example
presented. Part of the problem is that documented behavior of the POSIX
command line utilities is different from the program API behavior. Also,
real-world behavior sometimes seems different from POSIX specifications.
Sigh.

I'd really like to be put in contact with someone who has access to and is
familiar with POSIX variants used in Asia.

--Beman


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk