Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2011-01-27 08:55:20


On Thu, Jan 27, 2011 at 8:45 PM, Matus Chochlik <chochlik_at_[hidden]> wrote:
> On Thu, Jan 27, 2011 at 12:09 PM, Dean Michael Berris
> <mikhailberis_at_[hidden]> wrote:
>>
>> But why do you need to separate text encoding from encoding in
>> general? Here's the logic:
>
> In general? Nothing. I do not have (nor did I have in the past)
> anything against a general efficient encoding-agnostic string
> if it is called general_string. But std::string IMO is and always
> has been primarily about handling text. I certainly do not know
> anyone who would store a MPEG inside std::string.
>

std::string has not been about handling text -- it's about
encapsulating the notion of a sequence of characters with a suitable
definition of `character`.

You have string algorithms that apply mostly to strings -- pattern
matching, slicing/concatenation, character location, tokenization,
etc. The notion of "text" is actually a higher concept which imbues a
string with things like encoding, language, locality, etc. which all
live at a different level.

As for people storing <encoded> data inside a string, note that most
text-based protocols transfer things now in Base64 or Base32 or some
variant of that encoding -- precisely so that they can be dealt with
as character sequences. If you were catching an XMPP stream-fed Base64
encoded H.264 video stream why not put it in a string? I wouldn't put
it in std::string if I had any *sane* choice because it's just broken
IMO but like most people who intend to do things with data in memory
gotten from a character stream, you put it in a string.

>>
>> You have a sequence of bytes (in a string).
>> You want to interpret that sequence of bytes in a given encoding (with a view).
>>
>> Why does the encoding have to apply only to text?
>
> Encoding does not have to apply only to text, but my,
> let's call it a vision, is, that the "everyday" handling
> of text would use a single encoding. There are people
> who have invested a whole lotta of love :) and time into
> making it possible and they are generally called
> Unicode consortium. C++(1x) already adopts part of
> their work via the u"" and U"" literal types, because
> it has countless advantages. Why not take a one more
> step in that direction and use it for the 'string' type by
> default.
>

So the literals are already encoded and guess what, they're still a
sequence of bytes. The only "sane" way to deal with it is to provide
an appropriate *view* of the encoded data in the appropriate level of
abstraction. A string I argue is *not* that level of abstraction.

>>
>> So what's the difference between a string for encoding human readable
>> text and a string that handles raw data?
>
> Usability. It is usually more difficult to use the super-generic everything-
> solving things. I again for probably the 10-th time repeat that I'm not against
> such string in general but this is not std::string.
>

Usability of what, the type? Any type is as usable as any other the
way I see it -- they're all just types. So aside from
aesthetic/cosmetic differences, what's the point?

>>
>> So what's wrong with:
>>
>> view<some_encoding_0> x = get_x();
>> view<some_encoding_1> y = get_y();
>> view<some_encoding_3> z = x+y;
>> float w = log(as<acme_float_encoding>(z));
>
> Unnecessary verbosity.
>

What verbosity?

We deal with that through typedefs and descriptive names. Heck C++0x
has auto so I don't know what 'verbosity' you're referring to.

And if you really wanted to know the encoding of the data from the
type, how else would you do it?

> Do you really want all the people that now do:
>
> struct person
> {
>    std::string name;
>    std::string middle_name;
>    std::string family_name;
>    // .. etc.
> };
>
> to do this ?
>
> struct person
> {
>    boost::view<some_encoding_tag> name;
>    boost::view<some_encoding_tag> middle_name;
>    boost::view<some_encoding_tag> family_name;
>    // .. etc.
> };
>

Well:

typedef boost::strings::view<boost::strings::utf8_encoding> utf8_string;

struct person {
  utf8_string name, middle_name, family_name;
};

Where's the verbosity in that?

>
>>
>> ?
>>
>> See, there's absolutely 0 reason why you *have* to deal with a raw
>> sequence of bytes if what you really want is to deal with a view of
>> these bytes from the outset.
>>
>> Again I ask, am I missing something here?
>
> Please see the example above.
>

I did and I saw an even more succinct way of doing it. So again, I
don't see what I'm missing here.

> [snip/]
>>
>> Right, what I meant to say is that it hardly has any bearing when
>> we're talking about engineering solutions. So your circumstances and
>> mine may very well be different, but that doesn't change that we're
>> trying to solve the same problem. :)
>>
>
> If along solving your problem (all the completely valid points
> that you had about the performance) we also solve my and
> other's problem (completely valid points about the encoding)
> and we think about the acceptability and "adoptability",

I don't know what "acceptability" and "adoptability" mean in this context.

Both of these are a matter of taste and not of technical merit.

> we provide a backward compatible interface for people who
> do not have the time to re-implement all their string-related
> code at once and try really hard to get it into the standard
> than I do not have a thing against it.

Backward compatibility to a broken implementation hardly seems like a
worthy goal. Deprecation is a better route IMHO.

Even if it does become std::string, it will be a deprecation of the
original definition. Deprecation *is* an option.

HTH

-- 
Dean Michael Berris
about.me/deanberris

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk