Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2011-01-27 06:09:09


On Thu, Jan 27, 2011 at 5:32 PM, Matus Chochlik <chochlik_at_[hidden]> wrote:
> On Thu, Jan 27, 2011 at 4:49 AM, Dean Michael Berris
> <mikhailberis_at_[hidden]> wrote:
>> On Thu, Jan 27, 2011 at 2:28 AM, Matus Chochlik <chochlik_at_[hidden]> wrote:
>>
>> Being part of the type is "carrying".
>
> Yes, but a polymorphic type could also "carry" the encoding information
> it wasn't clear to me what exactly do you have in mind.
>

Sorry, I had no intentions of implying that a template would carry
type information any other way than just having that information part
of the type.

> [snip/]
>>
>> I don't think I was questioning why UTF-8 specifically. I was
>> questioning why there had to be a "default is UTF-8" when really it's
>> just a sequence of bytes whether UTF-8, MPEG, Base64, MIME, etc.
>
> Last time I checked, JPEG, MPEG, Base64, ASN1, etc., etc., were not
> *text* encodings. And I believe that handling text is what the whole
> discussion is ultimately about.
>

But why do you need to separate text encoding from encoding in
general? Here's the logic:

You have a sequence of bytes (in a string).
You want to interpret that sequence of bytes in a given encoding (with a view).

Why does the encoding have to apply only to text?

> [snip/]
>>
>> But this already happens, it's called 7-bit clean byte encoding --
>> barring any endianness issues, just stuff whatever you already have in
>> a `char const *` into a socket. HTTP, FTP, and even memcached's
>> protocol work fine without the need to interpret strings other than a
>> sequence of bytes; my original opposition is having a string that by
>> default looked at data in it as UTF-8 when really a string would just
>> be a sequence of bytes not necessarily contiguous.
>
> Again, where you see a string primarily as a class for handling
> raw data, that can be interpreted in hundreds of different ways
> I see primarily string as a class for encoding human readable text.
>

So what's the difference between a string for encoding human readable
text and a string that handles raw data?

>>
>> Yes that's the intention. There's even an alternative (really F'n ugly
>> way) I suggested as well:
>>
>>  char evil_stack_buffer[256];
>>  linearize(string, evil_stack_buffer, 256);
>
> Of course it is an alternative, but there are also lots
> of functions in various APIs the ShellExecute above
> being one of them where you would need 4-10 such
> evil_stack_buffers and the performance gain compared
> to the loss related to the ugliness and C-ness of the
> code is not worth it (for me). If I liked that kind of programming
> I would use C all the time and not only in places where
> absolutely necessary.
>

I didn't disagree with your original statement and I think both
interfaces -- the one that returns a pointer and the one that takes a
buffer with length as arguments -- have a place in the same world.

>>
>> I know why UTF-* have their merits for encoding all the "characters"
>> for the languages of the world. However I don't see why it has to be
>> default for a string.
>
> Because the byte sequence is interpreted into *text*.

So?

> Let me try one more time: Imagine that someone
> proposed to you that he creates a ultra-fast-and-generic
> type for handling floating point numbers and there would
> be ~200 possible encodings for a float or double and
> the usage of the type would be
>
> uber_float x = get_x();
> uber_float y = get_y();
> uber_float z = view<acme_float_encoding_123_331_4342_Z>(x) +
> view<acme_float_encoding_123_331_4342_Z>(y);
> uber_float w = third::party::math::log(view<acme_float_encoding_452323_X>(z));
>
> would you choose it to calculate your z = x + y
> and w = log(z) in the 98% of the regular cases where
> you don't need to handle numbers on the helluva-big
> scale/range/precision? I would not.
>

So what's wrong with:

view<some_encoding_0> x = get_x();
view<some_encoding_1> y = get_y();
view<some_encoding_3> z = x+y;
float w = log(as<acme_float_encoding>(z));

?

See, there's absolutely 0 reason why you *have* to deal with a raw
sequence of bytes if what you really want is to deal with a view of
these bytes from the outset.

Again I ask, am I missing something here?

>>
>>> If this is not obvious; I live in a part of the world where ASCII
>>> is just not enough and, believe me, we are all tired here of juggling
>>> with language-specific code-pages :)
>>>
>>
>> Nope, it's not obvious, but it's largely a matter of circumstance
>> really I would say. ;)
>
> But the most of the world actually lives under these
> circumstances. :)
>

Right, what I meant to say is that it hardly has any bearing when
we're talking about engineering solutions. So your circumstances and
mine may very well be different, but that doesn't change that we're
trying to solve the same problem. :)

-- 
Dean Michael Berris
about.me/deanberris

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk