Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2011-01-26 22:49:53


On Thu, Jan 27, 2011 at 2:28 AM, Matus Chochlik <chochlik_at_[hidden]> wrote:
> On Wed, Jan 26, 2011 at 6:26 PM, Dean Michael Berris
> <mikhailberis_at_[hidden]> wrote:
>> On Thu, Jan 27, 2011 at 12:43 AM, Matus Chochlik <chochlik_at_[hidden]> wrote:
>>
>> So really this wrapper is the 'view' that I talk about that carries
>> with it an encoding and the underlying data. Right?
>
> Basically right, but generally I can imagine that
> the encoding would not have to be 'carried' by the view
> but just 'assumed'. But if by carrying you mean that it'll
> have just some tag template argument without any (too much)
> internal state, then: just Right.
>

Being part of the type is "carrying".

>>
>> I don't see the value in this though requiring that it be part of the
>> 'text'. I could easily write something like:
>
> The value of choosing Unicode is, that when processing text,
>  you don't have to worry if you picked the right code table to interpret
> the text into characters, another "win" is that on many platforms
> UTF-8 is already being used as the default encoding where
> things are not completely encoding-independent so you actually
> do not have to do any transcoding to view the data as UTF-8 because
> it already is UTF-8.
>

I don't think I was questioning why UTF-8 specifically. I was
questioning why there had to be a "default is UTF-8" when really it's
just a sequence of bytes whether UTF-8, MPEG, Base64, MIME, etc.

> I'm not a machine so when I see written text I see characters
> not byte-sequences and code tables and I would like to be
> able to handle text in my programs at the level of code-points
> if not logical characters too, and let the computer handle the
> encoding validation, etc. just like I don't have to specify
> how to encode a float or double.
>
> Another matter is that if you send your text by whatever means
> you choose to someone at the other end of the world, then
> he will see the same characters without any need for
> transcoding, locale-picking, mumbo jumbo.
>

But this already happens, it's called 7-bit clean byte encoding --
barring any endianness issues, just stuff whatever you already have in
a `char const *` into a socket. HTTP, FTP, and even memcached's
protocol work fine without the need to interpret strings other than a
sequence of bytes; my original opposition is having a string that by
default looked at data in it as UTF-8 when really a string would just
be a sequence of bytes not necessarily contiguous.

[snip parts where we already agree]
>>
>> I don't see why the default and the other encoding case are really
>> that different from an interface perspective. The underlying string
>> will still be a series of bytes in memory, and encoding is just a
>> matter of viewing it a given way. Right?
>
> if there is ...
>
> typedef something_beyond_my_current_level_of_comprehension native_encoding;
> typedef ... native_wide_encoding;
>
> ... which works as I described above with your view ...
>
> text my_text = init();
>
> and I can do for example:
> ShellExecuteW(
>    ...,
>    cstr(view<native_wide_encoding>(my_text)),
>    ...
> );
>
>
> ... (and we can have some shorthand for c_str(view<native_wide_encoding>(x))),
> then, basically, Right.
>

Yes that's the intention. There's even an alternative (really F'n ugly
way) I suggested as well:

  char evil_stack_buffer[256];
  linearize(string, evil_stack_buffer, 256);

Which means you can let the user of the interface define where the
linearized version of the immutable string would be placed.

>>
>> So what if `typedef view<Encoding> utf8` was there how far would that
>> be from the default encoding case? And why does it have to be
>> especially UTF for that matter?
>
> Because it is an already acepted and widely used text-encoding,
> standard capable  of representing (basically) every writing system
> known to mankind, even some of the invented ones (Quenia, Klingon,...)
> *at once* (in the same text) without any encoding switching mumbo-jumbo
> being used :)

I think I was asking why make a string to default encode in UTF-8 when
UTF-8 was really just a means of interpreting a sequence of bytes. Why
a string would have to do that by default is what I don't understand
-- and which is why I see it as a decoupling of a view and an
underlying string.

I know why UTF-* have their merits for encoding all the "characters"
for the languages of the world. However I don't see why it has to be
default for a string.

> If this is not obvious; I live in a part of the world where ASCII
> is just not enough and, believe me, we are all tired here of juggling
> with language-specific code-pages :)
>

Nope, it's not obvious, but it's largely a matter of circumstance
really I would say. ;)

>>
>> Why is it inefficient when there's no need for an actual copy to be involved?
>
> No unnecessary copying => /me not complaining :)
>

Cool. :)

>>
>>
>> typedef view<utf8_encoding> utf8;
>>
>> I don't see why that shouldn't work for your requirements. :)
>
> Make that
>
> typedef view<utf8_encoding> text;
>
> and I will be completely happy and very grateful. :-)
> we can always work out the little things in the process.
>

Agreed. Now I'm prepared to move on and solidify the interface to the
immutable string and the views.

Expect some (long, potentially tiring, but hopefully more coherent)
output in terms of an interface proposal in a few hours. :)

-- 
Dean Michael Berris
about.me/deanberris

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk