Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Matus Chochlik (chochlik_at_[hidden])
Date: 2011-01-27 04:32:43


On Thu, Jan 27, 2011 at 4:49 AM, Dean Michael Berris
<mikhailberis_at_[hidden]> wrote:
> On Thu, Jan 27, 2011 at 2:28 AM, Matus Chochlik <chochlik_at_[hidden]> wrote:
>> On Wed, Jan 26, 2011 at 6:26 PM, Dean Michael Berris
>> <mikhailberis_at_[hidden]> wrote:
>>> On Thu, Jan 27, 2011 at 12:43 AM, Matus Chochlik <chochlik_at_[hidden]> wrote:
>>>
>>> So really this wrapper is the 'view' that I talk about that carries
>>> with it an encoding and the underlying data. Right?
>>
>> Basically right, but generally I can imagine that
>> the encoding would not have to be 'carried' by the view
>> but just 'assumed'. But if by carrying you mean that it'll
>> have just some tag template argument without any (too much)
>> internal state, then: just Right.
>>
>
> Being part of the type is "carrying".

Yes, but a polymorphic type could also "carry" the encoding information
it wasn't clear to me what exactly do you have in mind.

[snip/]
>
> I don't think I was questioning why UTF-8 specifically. I was
> questioning why there had to be a "default is UTF-8" when really it's
> just a sequence of bytes whether UTF-8, MPEG, Base64, MIME, etc.

Last time I checked, JPEG, MPEG, Base64, ASN1, etc., etc., were not
*text* encodings. And I believe that handling text is what the whole
discussion is ultimately about.

[snip/]
>
> But this already happens, it's called 7-bit clean byte encoding --
> barring any endianness issues, just stuff whatever you already have in
> a `char const *` into a socket. HTTP, FTP, and even memcached's
> protocol work fine without the need to interpret strings other than a
> sequence of bytes; my original opposition is having a string that by
> default looked at data in it as UTF-8 when really a string would just
> be a sequence of bytes not necessarily contiguous.

Again, where you see a string primarily as a class for handling
raw data, that can be interpreted in hundreds of different ways
I see primarily string as a class for encoding human readable text.

[snip/]
>>
>> if there is ...
>>
>> typedef something_beyond_my_current_level_of_comprehension native_encoding;
>> typedef ... native_wide_encoding;

On second thought, this probably should be a type templated
with a char type.

>>
>> ... which works as I described above with your view ...
>>
>> text my_text = init();
>>
>> and I can do for example:
>> ShellExecuteW(
>>    ...,
>>    cstr(view<native_wide_encoding>(my_text)),
>>    ...
>> );
>>
>>
>> ... (and we can have some shorthand for c_str(view<native_wide_encoding>(x))),
>> then, basically, Right.
>>
>
> Yes that's the intention. There's even an alternative (really F'n ugly
> way) I suggested as well:
>
>  char evil_stack_buffer[256];
>  linearize(string, evil_stack_buffer, 256);

Of course it is an alternative, but there are also lots
of functions in various APIs the ShellExecute above
being one of them where you would need 4-10 such
evil_stack_buffers and the performance gain compared
to the loss related to the ugliness and C-ness of the
code is not worth it (for me). If I liked that kind of programming
I would use C all the time and not only in places where
absolutely necessary.

>
> Which means you can let the user of the interface define where the
> linearized version of the immutable string would be placed.
>
[snip/]
>
> I think I was asking why make a string to default encode in UTF-8 when
> UTF-8 was really just a means of interpreting a sequence of bytes. Why
> a string would have to do that by default is what I don't understand
> -- and which is why I see it as a decoupling of a view and an
> underlying string.
>
> I know why UTF-* have their merits for encoding all the "characters"
> for the languages of the world. However I don't see why it has to be
> default for a string.

Because the byte sequence is interpreted into *text*.
Let me try one more time: Imagine that someone
proposed to you that he creates a ultra-fast-and-generic
type for handling floating point numbers and there would
be ~200 possible encodings for a float or double and
the usage of the type would be

uber_float x = get_x();
uber_float y = get_y();
uber_float z = view<acme_float_encoding_123_331_4342_Z>(x) +
view<acme_float_encoding_123_331_4342_Z>(y);
uber_float w = third::party::math::log(view<acme_float_encoding_452323_X>(z));

would you choose it to calculate your z = x + y
and w = log(z) in the 98% of the regular cases where
you don't need to handle numbers on the helluva-big
scale/range/precision? I would not.

>
>> If this is not obvious; I live in a part of the world where ASCII
>> is just not enough and, believe me, we are all tired here of juggling
>> with language-specific code-pages :)
>>
>
> Nope, it's not obvious, but it's largely a matter of circumstance
> really I would say. ;)

But the most of the world actually lives under these
circumstances. :)

>
[snip/]

BR,

Matus


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk