Boost logo

Boost :

Subject: Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World
From: Beman Dawes (bdawes_at_[hidden])
Date: 2012-01-29 10:52:28


On Sat, Jan 28, 2012 at 2:48 PM, Yakov Galka <ybungalobill_at_[hidden]> wrote:
> My opinion:
>
>
>   1. You shall not use any char type other than char and wchar_t for
>   working with strings. Using the char type and/or char_traits to mark the
>   encoding doesn't work. This is because the standard provided facets, C
>   standard library functions etc. are provided almost only for char and
>   wchar_t types. And we *don't want* to specialize all possible facets for
>   each possible encoding, just as we don't want to add u16sprintf,
>   u32sprintf, u16cout, u32cout, etc... This would effectively increase the
>   size of the interface to ϴ(number-of-entities × number-of-encodings).
>   Following the above you won't use char32_t and char16_t added in C++11
>   either. You will use just one or two encodings internally that will be
>   those used for char and wchar_t according to the conventions in your code
>   and/or the platform you work with. The only place you may need the char**_t
>   types is when converting from UTF-16/UTF-32 into the internal encoding you
>   use for your strings (either narrow or wide). But in those conversion
>   algorithms uint_least32_t and uint_least16_t suit your needs just fine.

I agree with you that "we *don't want* to specialize all possible facets for
each possible encoding, just as we don't want to add u16sprintf,
u32sprintf, u16cout, u32cout, etc...". Hopefully someone will step
forward with a set of deeply Unicode aware generic algorithms to take
advantage of Unicode specific functionality.

I personally prefer char32_t and char16_t to uint_least32_t and
uint_least16_t, but don't have enough experience to the C++11 types to
make blanket recommendations.

>
>   2. "Standard library strings with different character encodings have
>   different types that do not interoperate." It's good. There shall no be
>   implicit conversions in user code. If the user wants, she shall specify the
>   conversion explicitly, as in:
>
>   s2 = convert-with-whatever-explicit-interface-you-like("foo");

int x;
long y;
...
y = x;
...
x = y;

Nothing controversial here, and very convenient. The x = y conversion
is lossy, but the semantics are well defined and you can always use a
function call if you want different semantics.

string x;
u32string y;
...
y = x;
...
x = y;

Why is this any different? It is very convenient. We can argue about
the best semantics for the x = y conversion, but once those semantics
are settled you can always use a function call if you want different
semantics.

>   3. "...class path solves some of the string interoperability
>   problems..." Class path forces the user to use a specific encoding that she
>   even may not be willing to hear of. It manifests in the following ways:
>      - The 'default' interface returns the encoding used by the system,
>      requiring the user to use a verbose interface to get the
> encoding she uses.
>      - If the user needs to get the path encoded in her favorite encoding
>      *by reference* with a lifetime of the path (e.g. as a parameter
> to an async
>      call), she must maintain a long living *copy* of the temporary returned
>      from the said interface.
>      - Getting the extension from a narrow-string path using boost::path
>      on Windows involves *two* conversions although the system is never called
>      in the middle.
>      - Library code can't use path::imbue(). It must pass the
>      corresponding codecvt facet everywhere to use anything but the
>      (implementation defined and volatile at runtime) default.

My contention is that class path is having to take on conversion
responsibilities that are better performed by basic_string. That part
of the motivation for exploring ways string classes could take on some
of those responsibilities.

>
>      4. "Can be called like this: (example)" So we had 2 encodings to
>   consider before C++11, 4 after the additions in C++11 and you're proposing
>   additions to make it easier to work with any number of encodings. We are
>   moving towards encoding HELL.

The number of encodings isn't a function of C++, it is a function of
the real-world. Traditionally, there were many encodings in wide use,
and then Unicode came along with a few more. But the Unicode encodings
have enough advantages that users are gradually moving away from
non-Unicode encodings. C++ needs to accommodate that trend by becoming
friendlier to the Unicode encodings.

>
>   5. "A "Hello World" program using a C++11 Unicode string literal
>   illustrates this frustration:" Unicode string literal (except u8)
>   illustrates how adding yet another unneeded feature to the C++ standard
>   complicates the language, adds problems, adds frustration and solves
>   nothing. The user can just write
>
>   cout << u8"您好世界";
>
>   Even better is:
>
>   cout << "您好世界";
>
>   which *just works* on most compilers (e.g. GCC: http://ideone.com/lBpMJ)
>   and needs some trickery on others (MSVC: save as UTF-8 without BOM). A much
>   simpler solution is to standardize narrow string literals to be UTF-8
>   encoded (or a better phrasing would be "capable of storing any Unicode
>   data" so this will work with UTF-EBCDIC where needed), but I know it's too
>   much to ask.

I'm not sure that is too much to ask for the C++ standard after C++11,
whatever it ends up being called. It would take a lot of some careful
work to bring the various interests on board. A year ago was the wrong
point in the C++ standard revision cycle to even talks about such a
change. But C++11 has shipped. Now is the time to start the process of
moving the problem onto the committee's radar screen.

>
>   6. "String conversion iterators are not provided (minus Example)" This
>   section *I fully support*. The additions to C++11 pushed by Dinkumware are
>   heavy, not general enough, and badly designed. C++11 still lacks convenient
>   conversion between different Unicode encodings, which is a must in today's
>   world. Just a few notes:
>      - "Interfaces work at the level of entire strings rather than
>      characters," This *is* desired since the overhead of the temporary
>      allocations is repaid by the fact that optimized UTF-8↔UTF-16↔UTF-32
>      conversions need large chunks of data. Nevertheless I agree that iterator
>      access is sometimes preferred.
>      - Instead of the c_str() from "Example" a better approach is to
>      provide a convenience non-member function that can work on any range of
>      chars. E.g. using the "char type specifies the encoding" approach this
>      would be:
>
>      std::wstring wstr = convert<wchar_t>(u8"您好世界"); // doesn't even
>      construct an std::string
>      std::string u8str = convert<char>(wstr); // don't care for the name

While I'm totally convinced that conversion iterators would be very
useful, the exact form is an open question. Could you be more
specific about the details of your convert suggestion?

>      7. True interoperability, portability and conciseness will come when
>   we standardize on *one* encoding.

Even if we are only talking about Unicode, multiple encodings still
seem a necessity.

--Beman


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk