Boost logo

Boost :

Subject: Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World
From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2012-01-30 12:00:53


On Sun, Jan 29, 2012 at 17:52, Beman Dawes <bdawes_at_[hidden]> wrote:
[...]

> I personally prefer char32_t and char16_t to uint_least32_t and
> uint_least16_t, but don't have enough experience to the C++11 types to
> make blanket recommendations.
>

I don't care for the name. I claim that we don't need a distinct type with
a keyword for that.

> >
> > 2. "Standard library strings with different character encodings have
> > different types that do not interoperate." It's good. There shall no be
> > implicit conversions in user code. If the user wants, she shall
> specify the
> > conversion explicitly, as in:
> >
> > s2 = convert-with-whatever-explicit-interface-you-like("foo");
>
> int x;
> long y;
> ...
> y = x;
> ...
> x = y;
>
> Nothing controversial here, and very convenient. The x = y conversion
> is lossy, but the semantics are well defined and you can always use a
> function call if you want different semantics.
>

It is controversial. It was inherited from C where even void* -> int*
conversion was possible. Some argue that x = y should be an error. See D&E
14.3.5.2. Most compilers issue a warning for this. Note that where
compatibility with C is not a concern, C++ prohibits narrowing conversions:

vector<int> v = {1, 2, 3};
vector<short> v1 = {v[0], v[1], v[2]};
vector<long> v2 = v; // not narrowing but fails too

Btw, x = y is implementation-defined if y is a large negative, not "well
defined".

string x;
> u32string y;
> ...
> y = x;
> ...
> x = y;
>
> Why is this any different? It is very convenient. We can argue about
> the best semantics for the x = y conversion, but once those semantics
> are settled you can always use a function call if you want different
> semantics.
>

Convenient: yes. But not every convenient feature is good. It can do harm.
First two things that come to mind are:

   1. Overload resolution ambiguity or surprising results.
   2. It hides potentially expensive conversions (I agree to do these
   implicitly only when interacting with 3rd-party code).
   3. It eases different encodings interoperability, thus postponing
   one-encoding standardization, yet doesn't solve the headache completely
   (still the user has to think about encodings and choose a string she needs
   from this zoo: string, u16string, u32string...).

And why don't we have std::string::operator const char*()?

> 3. "...class path solves some of the string interoperability
> > problems..." Class path forces the user to use a specific encoding
> that she
> > even may not be willing to hear of. It manifests in the following ways:
> > - The 'default' interface returns the encoding used by the system,
> > requiring the user to use a verbose interface to get the
> > encoding she uses.
> > - If the user needs to get the path encoded in her favorite encoding
> > *by reference* with a lifetime of the path (e.g. as a parameter
> > to an async
> > call), she must maintain a long living *copy* of the temporary
> returned
> > from the said interface.
> > - Getting the extension from a narrow-string path using boost::path
> > on Windows involves *two* conversions although the system is never
> called
> > in the middle.
> > - Library code can't use path::imbue(). It must pass the
> > corresponding codecvt facet everywhere to use anything but the
> > (implementation defined and volatile at runtime) default.
>
> My contention is that class path is having to take on conversion
> responsibilities that are better performed by basic_string. That part
> of the motivation for exploring ways string classes could take on some
> of those responsibilities.
>

Good. But my intent is to move the conversions either inside operational
functions (preferable). Till we can't standardize on a Unicode execution
character set let the conversion happen when calling those functions
(perhaps use a path_ref that does it implicitly if we don't want the FS v2
templated functions). I remind that class path is used not just for calling
the system.

> >
> > 4. "Can be called like this: (example)" So we had 2 encodings to
> > consider before C++11, 4 after the additions in C++11 and you're
> proposing
> > additions to make it easier to work with any number of encodings. We
> are
> > moving towards encoding HELL.
>
> The number of encodings isn't a function of C++, it is a function of
> the real-world. Traditionally, there were many encodings in wide use,
> and then Unicode came along with a few more. But the Unicode encodings
> have enough advantages that users are gradually moving away from
> non-Unicode encodings. C++ needs to accommodate that trend by becoming
> friendlier to the Unicode encodings.
>

Sure. But it doesn't mean that it have to be friendlier to ALL Unicode
encodings.

> >
> > 5. "A "Hello World" program using a C++11 Unicode string literal
> > illustrates this frustration:" Unicode string literal (except u8)
> > illustrates how adding yet another unneeded feature to the C++ standard
> > complicates the language, adds problems, adds frustration and solves
> > nothing. The user can just write
> >
> > cout << u8"您好世界";
> >
> > Even better is:
> >
> > cout << "您好世界";
> >
> > which *just works* on most compilers (e.g. GCC:
> http://ideone.com/lBpMJ)
> > and needs some trickery on others (MSVC: save as UTF-8 without BOM). A
> much
> > simpler solution is to standardize narrow string literals to be UTF-8
> > encoded (or a better phrasing would be "capable of storing any Unicode
> > data" so this will work with UTF-EBCDIC where needed), but I know it's
> too
> > much to ask.
>
> I'm not sure that is too much to ask for the C++ standard after C++11,
> whatever it ends up being called. It would take a lot of some careful
> work to bring the various interests on board. A year ago was the wrong
> point in the C++ standard revision cycle to even talks about such a
> change. But C++11 has shipped. Now is the time to start the process of
> moving the problem onto the committee's radar screen.
>

Thanks for the forecast!

> >
> > 6. "String conversion iterators are not provided (minus Example)" This
> > section *I fully support*. The additions to C++11 pushed by Dinkumware
> are
> > heavy, not general enough, and badly designed. C++11 still lacks
> convenient
> > conversion between different Unicode encodings, which is a must in
> today's
> > world. Just a few notes:
> > - "Interfaces work at the level of entire strings rather than
> > characters," This *is* desired since the overhead of the temporary
> > allocations is repaid by the fact that optimized UTF-8↔UTF-16↔UTF-32
> > conversions need large chunks of data. Nevertheless I agree that
> iterator
> > access is sometimes preferred.
> > - Instead of the c_str() from "Example" a better approach is to
> > provide a convenience non-member function that can work on any
> range of
> > chars. E.g. using the "char type specifies the encoding" approach
> this
> > would be:
> >
> > std::wstring wstr = convert<wchar_t>(u8"您好世界"); // doesn't even
> > construct an std::string
> > std::string u8str = convert<char>(wstr); // don't care for the name
>
> While I'm totally convinced that conversion iterators would be very
> useful, the exact form is an open question. Could you be more
> specific about the details of your convert suggestion?
>

The point is that it's more like a free-standing c_str() you proposed.
Unlike c_str() member function it would work on any character range, and
returns a range of converting iterators. We don't need to extent
basic_string for this, which is already too big.

> > 7. True interoperability, portability and conciseness will come when
> > we standardize on *one* encoding.
>
> Even if we are only talking about Unicode, multiple encodings still
> seem a necessity.
>

Unicode algorithms work on code points (UCS-4) internally. Everything else
can be encoded in some (narrow) execution character set capable of storing
Unicode. Almost no-one implements Unicode algorithms, thus we can
practically assume that one encoding is sufficient on each platform.

-- 
Yakov

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk