Boost logo

Boost :

Subject: Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World
From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2012-01-28 14:48:08


My opinion:

   1. You shall not use any char type other than char and wchar_t for
   working with strings. Using the char type and/or char_traits to mark the
   encoding doesn't work. This is because the standard provided facets, C
   standard library functions etc. are provided almost only for char and
   wchar_t types. And we *don't want* to specialize all possible facets for
   each possible encoding, just as we don't want to add u16sprintf,
   u32sprintf, u16cout, u32cout, etc... This would effectively increase the
   size of the interface to Ï´(number-of-entities × number-of-encodings).
   Following the above you won't use char32_t and char16_t added in C++11
   either. You will use just one or two encodings internally that will be
   those used for char and wchar_t according to the conventions in your code
   and/or the platform you work with. The only place you may need the char**_t
   types is when converting from UTF-16/UTF-32 into the internal encoding you
   use for your strings (either narrow or wide). But in those conversion
   algorithms uint_least32_t and uint_least16_t suit your needs just fine.

   2. "Standard library strings with different character encodings have
   different types that do not interoperate." It's good. There shall no be
   implicit conversions in user code. If the user wants, she shall specify the
   conversion explicitly, as in:

   s2 = convert-with-whatever-explicit-interface-you-like("foo");

   3. "...class path solves some of the string interoperability
   problems..." Class path forces the user to use a specific encoding that she
   even may not be willing to hear of. It manifests in the following ways:
      - The 'default' interface returns the encoding used by the system,
      requiring the user to use a verbose interface to get the
encoding she uses.
      - If the user needs to get the path encoded in her favorite encoding
      *by reference* with a lifetime of the path (e.g. as a parameter
to an async
      call), she must maintain a long living *copy* of the temporary returned
      from the said interface.
      - Getting the extension from a narrow-string path using boost::path
      on Windows involves *two* conversions although the system is never called
      in the middle.
      - Library code can't use path::imbue(). It must pass the
      corresponding codecvt facet everywhere to use anything but the
      (implementation defined and volatile at runtime) default.

      4. "Can be called like this: (example)" So we had 2 encodings to
   consider before C++11, 4 after the additions in C++11 and you're proposing
   additions to make it easier to work with any number of encodings. We are
   moving towards encoding HELL.

   5. "A "Hello World" program using a C++11 Unicode string literal
   illustrates this frustration:" Unicode string literal (except u8)
   illustrates how adding yet another unneeded feature to the C++ standard
   complicates the language, adds problems, adds frustration and solves
   nothing. The user can just write

   cout << u8"您好世界";

   Even better is:

   cout << "您好世界";

   which *just works* on most compilers (e.g. GCC: http://ideone.com/lBpMJ)
   and needs some trickery on others (MSVC: save as UTF-8 without BOM). A much
   simpler solution is to standardize narrow string literals to be UTF-8
   encoded (or a better phrasing would be "capable of storing any Unicode
   data" so this will work with UTF-EBCDIC where needed), but I know it's too
   much to ask.

   6. "String conversion iterators are not provided (minus Example)" This
   section *I fully support*. The additions to C++11 pushed by Dinkumware are
   heavy, not general enough, and badly designed. C++11 still lacks convenient
   conversion between different Unicode encodings, which is a must in today's
   world. Just a few notes:
      - "Interfaces work at the level of entire strings rather than
      characters," This *is* desired since the overhead of the temporary
      allocations is repaid by the fact that optimized UTF-8↔UTF-16↔UTF-32
      conversions need large chunks of data. Nevertheless I agree that iterator
      access is sometimes preferred.
      - Instead of the c_str() from "Example" a better approach is to
      provide a convenience non-member function that can work on any range of
      chars. E.g. using the "char type specifies the encoding" approach this
      would be:

      std::wstring wstr = convert<wchar_t>(u8"您好世界"); // doesn't even
      construct an std::string
      std::string u8str = convert<char>(wstr); // don't care for the name

      7. True interoperability, portability and conciseness will come when
   we standardize on *one* encoding.

On Sat, Jan 28, 2012 at 18:46, Beman Dawes <bdawes_at_[hidden]> wrote:

> Beman.github.com/string-interoperability/interop_white_paper.html
> describes Boost components intended to ease string interoperability in
> general and Unicode string interoperability in particular.
>
> These proposals are the Boost version of the TR2 proposals made in
> N3336, Adapting Standard Library Strings and I/O to a Unicode World.
> See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3336.html.
>
> I'm very interested in hearing comments about either the Boost or the
> TR2 proposal. Are these useful additions? Is there a better way to
> achieve the same easy interoperability goals?
>
> Where is the best home for the Boost proposals? A separate library?
> Part of some existing library?
>
> Are these proposals orthogonal to the need for deeper Unicode
> functionality, such as Mathias Gaunard's Unicode components?
>
> --Beman
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>

Sincerely,

-- 
Yakov

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk