Subject: Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World
From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2012-01-28 14:48:08
1. You shall not use any char type other than char and wchar_t for
working with strings. Using the char type and/or char_traits to mark the
encoding doesn't work. This is because the standard provided facets, C
standard library functions etc. are provided almost only for char and
wchar_t types. And we *don't want* to specialize all possible facets for
each possible encoding, just as we don't want to add u16sprintf,
u32sprintf, u16cout, u32cout, etc... This would effectively increase the
size of the interface to Ï´(number-of-entities Ã number-of-encodings).
Following the above you won't use char32_t and char16_t added in C++11
either. You will use just one or two encodings internally that will be
those used for char and wchar_t according to the conventions in your code
and/or the platform you work with. The only place you may need the char**_t
types is when converting from UTF-16/UTF-32 into the internal encoding you
use for your strings (either narrow or wide). But in those conversion
algorithms uint_least32_t and uint_least16_t suit your needs just fine.
2. "Standard library strings with different character encodings have
different types that do not interoperate." It's good. There shall no be
implicit conversions in user code. If the user wants, she shall specify the
conversion explicitly, as in:
s2 = convert-with-whatever-explicit-interface-you-like("foo");
3. "...class path solves some of the string interoperability
problems..." Class path forces the user to use a specific encoding that she
even may not be willing to hear of. It manifests in the following ways:
- The 'default' interface returns the encoding used by the system,
requiring the user to use a verbose interface to get the
encoding she uses.
- If the user needs to get the path encoded in her favorite encoding
*by reference* with a lifetime of the path (e.g. as a parameter
to an async
call), she must maintain a long living *copy* of the temporary returned
from the said interface.
- Getting the extension from a narrow-string path using boost::path
on Windows involves *two* conversions although the system is never called
in the middle.
- Library code can't use path::imbue(). It must pass the
corresponding codecvt facet everywhere to use anything but the
(implementation defined and volatile at runtime) default.
4. "Can be called like this: (example)" So we had 2 encodings to
consider before C++11, 4 after the additions in C++11 and you're proposing
additions to make it easier to work with any number of encodings. We are
moving towards encoding HELL.
5. "A "Hello World" program using a C++11 Unicode string literal
illustrates this frustration:" Unicode string literal (except u8)
illustrates how adding yet another unneeded feature to the C++ standard
complicates the language, adds problems, adds frustration and solves
nothing. The user can just write
cout << u8"æ¨å¥½ä¸ç";
Even better is:
cout << "æ¨å¥½ä¸ç";
which *just works* on most compilers (e.g. GCC: http://ideone.com/lBpMJ)
and needs some trickery on others (MSVC: save as UTF-8 without BOM). A much
simpler solution is to standardize narrow string literals to be UTF-8
encoded (or a better phrasing would be "capable of storing any Unicode
data" so this will work with UTF-EBCDIC where needed), but I know it's too
much to ask.
6. "String conversion iterators are not provided (minus Example)" This
section *I fully support*. The additions to C++11 pushed by Dinkumware are
heavy, not general enough, and badly designed. C++11 still lacks convenient
conversion between different Unicode encodings, which is a must in today's
world. Just a few notes:
- "Interfaces work at the level of entire strings rather than
characters," This *is* desired since the overhead of the temporary
allocations is repaid by the fact that optimized UTF-8âUTF-16âUTF-32
conversions need large chunks of data. Nevertheless I agree that iterator
access is sometimes preferred.
- Instead of the c_str() from "Example" a better approach is to
provide a convenience non-member function that can work on any range of
chars. E.g. using the "char type specifies the encoding" approach this
std::wstring wstr = convert<wchar_t>(u8"æ¨å¥½ä¸ç"); // doesn't even
construct an std::string
std::string u8str = convert<char>(wstr); // don't care for the name
7. True interoperability, portability and conciseness will come when
we standardize on *one* encoding.
On Sat, Jan 28, 2012 at 18:46, Beman Dawes <bdawes_at_[hidden]> wrote:
> describes Boost components intended to ease string interoperability in
> general and Unicode string interoperability in particular.
> These proposals are the Boost version of the TR2 proposals made in
> N3336, Adapting Standard Library Strings and I/O to a Unicode World.
> See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3336.html.
> I'm very interested in hearing comments about either the Boost or the
> TR2 proposal. Are these useful additions? Is there a better way to
> achieve the same easy interoperability goals?
> Where is the best home for the Boost proposals? A separate library?
> Part of some existing library?
> Are these proposals orthogonal to the need for deeper Unicode
> functionality, such as Mathias Gaunard's Unicode components?
> Unsubscribe & other changes:
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk