Boost logo

Boost :

Subject: Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]
From: Matus Chochlik (chochlik_at_[hidden])
Date: 2011-01-24 05:44:15


On Sun, Jan 23, 2011 at 6:42 PM, Patrick Horgan <phorgan1_at_[hidden]> wrote:
[snip/]
>
> No.  They're using std::string.  It works just fine for this as it does for
> other things.  It's performance guarantees are in respect to their templated
> data type, not in terms of the encoding of the contents.   std::string lets
> you walk through a JIS string and decode it.  A utf-8 string would hurl
> chunks since, of course, it wouldn't be encoded as utf-8.  I could go on and
> on, but perhaps if you'd refresh yourself on the interface of std::string
> and think about the implications on that if you had a validating utf-8
> string you'd see.  I'm really in favor of an utf-8 string, I just wouldn't
> call it string because that would be a lie.  It wouldn't be a general
> string, but a special case of string.

This whole debate is at least for me about what the std::string
is and what we want it to be:

a) a little more that a glorified std::vector<char> with few extra
operations for a more convenient handling of the byte-sequences
stored inside, that can be currently interpreted in dozens if not
hundreds of ways, depending on the current platform, the default
or explicitly selected locale+encoding, etc., etc.

b) a container of byte sequences that represent human-readable text
and every single sequence (provided it is valid) can be translated
into exactly a single sequence of "logical" characters of the said text,
by a standardized mapping, but also provides operations for handling
the text character-by-character not only byte-by-byte (portably).
If the application wishes so, it still can treat the string only as the byte
sequence because this is of course a valid usage.

>
>> ... elision by me.
>> What functionality would they loose exactly ? Again, on many platforms
>> the default encoding for all (or nearly all) locales already is UTF-8 so
>> if you get a string from the OS API and store it into a std::string then
>> it is UTF-8 encoded. I do a equal share of programming on Windows
>> and Linux platforms and I have yet to run into these problems you
>> describe on Linux where for some time now the default encoding is UTF-8.
>> Actually today I encounter more problems on Windows, where I can't
>> set the locale to use UTF-8 and consequently I have to transcode data
>> from socket connections of files manually.
>
> I didn't say ever, that having utf-8 encoded characters in a std::string
> would cause you some kind of problems.  I don't think I even said anything
> that would let you infer that.  You're completely off base here.  I was
> talking about a string specialized for utf-8 encoded characters.  You're
> chasing a red-herring.  You're stalking a strawman target.  I agree with you
> entirely.  std::string does a great job of holding utf-8 encoded characters,
> as well as many other things.

OK,
[OT]
I was referring to the
[quote]
> Why would people want to lose so much of the functionality of std::string?
[/quote]
part. I meant no offense I merely said that I yet have to run into any
problems (loosing much of the functionality) on platforms where std::string
is used to hold byte-sequences encoded by the UTF-8 encoding.
I certainly don't need to be right or everyone agreeing with me, this
is a discussion and I gladly let myself to be educated by people
knowing more about the issue at hand than I do.
[/OT]
>
>> If you are talking about being able to have indexed random-access
>> to "logical characters" for example on Windows with some special
>> encodings, then this is only a platform-specific and unportable
>> functionality.
>> What I propose, is to extend the interface so that it would allow you
>> handle the "raw-byte-sequences" that are now used to represent strings
>> of logical characters in a platform independent way by using the Unicode
>> standard.
>
> That's nice.  I vote for that idea.  Just don't call it std::string, because
> it won't be, and you won't be able to do everything with it that std::string
> does today.

Would you care to elaborate what functionality would you loose ?
Even random access to individual characters could be implemented.
Of course this would break the existing performance guarantees,
that are however granted only on platforms which use std::string
for single-byte encodings. It could also employ some caching mechanism
to speed things up, but this is just an implementation detail and
of course has its trade-offs.

But, if this help us to (slowly) get rid of the necessity to handle various
encodings that are relics from an age where every single byte
of memory and every processor tick had been a precious resource,
then I am all for it. I imagine that the folks at Unicode consortium
have worked for the past 20+ years on the standard not only
to create a yet another encoding that would complement and live
happily ever after with all the others, but to replace them eventually.

Having said that I *do not* want to "ban" or prevent anyone from
using specific encodings where it is necessary or advantageous,
but such usage should be considered a special-case and not
general-usage as it is now. Many database systems, web-browsers,
web-content-creation tools, xml editors, etc., etc. are considering
UTF-8 to be the default and yes, they let you work with other encoding
but as a special case.

[snip/]
> That's an advantage of a utf-8 encoded file. You don't need a special string
> type to write to that. Before writing to the file you can hold the data in
> memory in a std::string or a C string, or a chunk of mmap'd memory today,
> and if they contain data encoded in utf-8 you have the same advantage.

That is not only what I can, do but also what I already do and I'm not very
happy with the results, because if std::string is by the OS/libraries now
expected to use platform-specific encoding, then these two do not play
together very well. Unless you (of course) transcode them explicitly.
I rarely use a mem'mapped file as a whole without trying to parse it
and use the data for example in a GUI.

[snip/]
>
> A string specialized to only hold utf-8 encoded data wouldn't be any good to
> someone not using utf-8 encoding.  Even if they were using 7-bit ascii for a
> network application, like ftp, for example, they'd have to pay the penalty
> for validating that the 7-bit ascii was valid utf-8.  If they're using it as
> a container to carry around encrypted strings, well that wouldn't be
> possible at all.

Let us have a "special_encoding_string" where we need to handle the legacy
encodings ...

>
> If a system call returned a name valid in that operating system that I would
> later pass to another system call, if it wasn't utf-8 what could I do?
>  Break?  Or corrupt the string?

... and a native_encoding_string or even let's use vector<char> for these
two (they are valid, but IMO *special*) use-cases.

[snip/]

> Vayo con Diós,
Hasta la vista :)

Matus


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk