Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Patrick Horgan (phorgan1_at_[hidden])
Date: 2011-01-27 16:57:34


On 01/27/2011 04:45 AM, Matus Chochlik wrote:
> ... elision by patrick ...
> In general? Nothing. I do not have (nor did I have in the past)
> anything against a general efficient encoding-agnostic string
> if it is called general_string. But std::string IMO is and always
> has been primarily about handling text. I certainly do not know
> anyone who would store a MPEG inside std::string.

You may think it strange, but there's a lot of code out there that uses
std::string as a binary buffer.

>> You have a sequence of bytes (in a string).
>> You want to interpret that sequence of bytes in a given encoding (with a view).
>>
>> Why does the encoding have to apply only to text?

It doesn't, and in your immutable string (or with std::string also) your
idea of views is a nice one. It would have different benefits than a
utf-xx_string with intrinsic encoding.

> Encoding does not have to apply only to text, but my,
> let's call it a vision, is, that the "everyday" handling
> of text would use a single encoding. There are people
> who have invested a whole lotta of love :) and time into
> making it possible and they are generally called
> Unicode consortium. C++(1x) already adopts part of
> their work via the u"" and U"" literal types, because
> it has countless advantages. Why not take a one more
> step in that direction and use it for the 'string' type by
> default.

That won't happen with std::string though. It's in the C++ spec as
behaving a certain way and you won't change that. You might have a
chance of getting a utf-8_string in there though.

>>> [snip/]
>>>> But this already happens, it's called 7-bit clean byte encoding --
>>>> barring any endianness issues, just stuff whatever you already have in
>>>> a `char const *` into a socket. HTTP, FTP, and even memcached's
>>>> protocol work fine without the need to interpret strings other than a
>>>> sequence of bytes; my original opposition is having a string that by
>>>> default looked at data in it as UTF-8 when really a string would just
>>>> be a sequence of bytes not necessarily contiguous.
>>> Again, where you see a string primarily as a class for handling
>>> raw data, that can be interpreted in hundreds of different ways
>>> I see primarily string as a class for encoding human readable text.

And you see it as encoding it in utf-8. Don't forget that. It's a very
specialized use out of the many that std::string supports today.

>> So what's the difference between a string for encoding human readable
>> text and a string that handles raw data?
> Usability. It is usually more difficult to use the super-generic everything-
> solving things. I again for probably the 10-th time repeat that I'm not against
> such string in general but this is not std::string.

And neither would a string that enforced utf-8 encoding be std::string.
We already have one in the spec, and it's not that.

> ... elision by patrick ...
> Unnecessary verbosity.
>
> Do you really want all the people that now do:
>
> struct person
> {
> std::string name;
> std::string middle_name;
> std::string family_name;
> // .. etc.
> };
>
> to do this ?
>
> struct person
> {
> boost::view<some_encoding_tag> name;
> boost::view<some_encoding_tag> middle_name;
> boost::view<some_encoding_tag> family_name;
> // .. etc.
> };

If their encoding is not utf-8 compatible it works with std::string, but
wouldn't work with your utf-8 string. Your argument is the same as
applied to your string.

> ... elision by patrick ...
>> Right, what I meant to say is that it hardly has any bearing when
>> we're talking about engineering solutions. So your circumstances and
>> mine may very well be different, but that doesn't change that we're
>> trying to solve the same problem. :)

No. You're not trying to solve the same problem at all! (And neither
of you are trying to deal with std::string.)

You, Dean, are trying to solve an efficiency problem caused by mutable
strings, and note that an external view can interpret as any encoding
desired. You correctly point out that this is more general and
flexible, that it has a power that can be applied to many things while
giving you all the efficiency advantages of immutable data types.
(Although why a general buffer for immutable data would be called string
which is normally associated with text _is_ a bit confusing. I suspect
you've gone down a road you never intended trying to make this point.)

You, Matus, are trying to solve a problem caused by a plethora of
possible encodings and the extra work that has to be done every time you
have to deal with them, by specifying that a string will have an
encoding type associated with it, (and in particular utf-8 as the
natural default), and that the specialized string itself will enforce
the encoding as well as provide ways to convert other encodings to it.
(And I think the natural way to do this is with code conversion
facets.) You correctly point out that this specificity allows a power
in solving this one particular problem that a more general solution
wouldn't be able to match. A general string with a view into it would
allow you to get invalidly encoded data into it (N.B for an immutable
string _into it_ would have a different meaning) and you would only know
about this after the fact.

These are both great things. Kudos to you both. You're both right.
You guys keep arguing apples and orangutans and it makes it hard for
others to talk about either one of your ideas because you're so busy
going back and forth telling each other that the other doesn't get what
they're trying to say.

I wish you'd split into threads like [immutable string] and [unicode
string].

Patrick


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk