Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Matus Chochlik (chochlik_at_[hidden])
Date: 2011-01-28 16:57:59


On Fri, Jan 28, 2011 at 10:31 PM, Dean Michael Berris
<mikhailberis_at_[hidden]> wrote:
> On Sat, Jan 29, 2011 at 5:13 AM, Matus Chochlik <chochlik_at_[hidden]> wrote:
>> On Fri, Jan 28, 2011 at 9:46 PM, Dean Michael Berris
>>>>
>>>>   All the discussion in started because we need UTF-8
>>>>   in strings now we are back to the beginning?
>>>>
>>>
>>> No, the discussion started because we need a UTF-8 view of data. You
>>> missed the point I was making. And you didn't understand the document
>>> I wrote.
>>
>> Sorry, but no. The discussion started by the proposal that we should
>> by default treat std::strings as if they were UTF-8 encoded.
>> Artyom should know because he was the one who did the original
>> proposal. The whole 'view' idea was brought up only much later.
>>
>
> And the point I was making was that, doing precisely this was the
> "wrong" way of doing it. Assuming a default encoding is "unnecessary"
> as an encoding is largely a matter of interpretation of data
> ultimately.
>
> I was attempting to solve the problem that is std::string. In the
> process I'm moving the issue away from the underlying data and moving
> it to a matter of interpretation. To do that in a manner that would
> make sense as how I see it, that means moving it into a view of the
> data that is held in a string. The string would be the data structure,
> the view an interpretation of it.
>
> I never precluded that the string can hold UTF-8 encoded data, but
> saying that is the default achieves nothing and is ultimately
> unnecessary. In the design I've been proposing the point of the matter
> is, interpreting data in a given encoding is separate from how the
> data is actually stored. Now let's say you have a UTF-8 string
> builder, what else would that write in memory aside from UTF-8 encoded
> data? It will though still yield a string, which could be interpreted
> many different ways -- I just don't see the encoding as something
> intrinsic to the string. That means a string can hold UTF-8 encoded
> data and I can wrap that in a view for UTF-16 and see that it will not
> validate correctly -- unless I wrap the string with a view for UTF-8
> first then pass that into a view for UTF-16 and transcoding can happen
> on the fly.
>
> Writing algorithms that deal with strings, is different from writing
> algorithms that deal with encoded text. That's two different levels.
>
> This explaining, and trying to explain again, the whole point of the
> matter makes me sound like a broken record. If you still don't get
> what I'm saying then I guess I'm going to have to try a different
> route and just show what I mean in terms of code at some point in
> time.

Dean, believe me, I got what you said the first time you said
it, like 200 posts ago. I know that the string data is ultimately
stored in the memory as a sequence of bytes. But then you
proposed to solve my problem by suggesting the view<Encoding>
template. Then like 50 posts ago we finally agreed on typedef-ing
and naming it 'text' since using something called view<encoding_tag>
is not acceptable for me.

Now, if this

typedef view<utf8_encoding_tag> text;

is the only line of code where I see the encoding and
I'll be able to do all the text handling, i.e.: searching
for code points/characters (not only bytes), searching for
words, concatenation, splitting, writing it into a file, socket,
etc. and reading it from file, socket, etc., using it
with some c_str-like adapter with C APIs, etc., basically
doing (nearly) everything that I was able to do with std::string
*without* ever mentioning the encoding again, the You already
have me convinced. If I cannot do those things without specifying
the encoding (unless necessary) then this is useless for me
for text handling.

Peace, Love, Best regards,

Matus


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk