Boost logo

Boost :

From: dylan_nicholson (dylan_nicholson_at_[hidden])
Date: 2002-03-03 21:36:52


--- In boost_at_y..., Jan Langer <jan_at_l...> wrote:
> hello
> the filesystem library will need a mechanism for converting
different
> char types into other char types. the same problem also occurs with
> basic_string. i think this is a quite genaral case and it is worth a
> general solution. an other example is a program using wstring and
> wanting to print to a normal char-stream.
> a reasonable way of solving this is to get a ctype facet and to
apply
> narrow or widen to the character.
>
However you can't assume that will do the correct mapping.
Example (I've given before) is that with Win9x, even though long
filenames are stored in UTF-16 format inside the FAT file system, the
only access to these names is via a Windows only propriety MBCS
encodings. To do those encodings you must use MultiByteToWideChar
(CP_ACP, ...) and WideCharToMultiByte(CP_ACP, ...).

The same is essentially true under NT except that of course NT *can*
handle both Unicode and MBCS filenames internally, so there really
should be no need for library code to do any conversions.
In fact MS do provide a "Unicode" layer for Win9x that does these
too, and it would not be (IMHO) unreasonable to simply require that
if you *wish* to use std::wstring to hold filenames and you want to
support Win9x then you must use MS's supplied library (as far as I
understand it, you simply download it, link it in your application,
and redistribute the DLL it with your application - it has some magic
to continue working correctly under NT). That way at least for the
Win32 implementation *no* wstring <-> string conversions should be
needed.

For POSIX however, assuming you go the ctype-narrow/widen approach,
the main issue is of course which locale to request. I would say
locale("") (ie the default "system" locale), but there probably needs
to be a once-off method of overriding this.

Does the latest MAC interface have any unicode support?

One thing that might be generically useful is UTF-8 <-> UTF-16 <->
UTF-32 conversion. Not much use for filesystem support seeing as
very few filesystems use these standards (fair enough...they didn't
exist until a few years), but extremely useful for internet based
protocols. The problem is deciding whether you are using wstring as
UTF-16 or UTF-32. Some people on c.l.c++.m have claimed that UTF-16
wouldn't be allowed because wstring isn't supposed to allow any multi-
char characters, but in fact even UTF-32 uses multi-char "combining
sequences" (esp for diacritics), so this argument doesn't hold with
me*. On the other hand UTF-32 is patently excessive expensive for
the vast majority of languages, and probably even the majority of
cases for languages that really do need 4 billion
different "characters".

Dylan

* See http://www.unicode.org/unicode/faq/char_combmark.html#7


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk