Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Matus Chochlik (chochlik_at_[hidden])
Date: 2011-01-26 13:28:45


On Wed, Jan 26, 2011 at 6:26 PM, Dean Michael Berris
<mikhailberis_at_[hidden]> wrote:
> On Thu, Jan 27, 2011 at 12:43 AM, Matus Chochlik <chochlik_at_[hidden]> wrote:
>
> So really this wrapper is the 'view' that I talk about that carries
> with it an encoding and the underlying data. Right?

Basically right, but generally I can imagine that
the encoding would not have to be 'carried' by the view
but just 'assumed'. But if by carrying you mean that it'll
have just some tag template argument without any (too much)
internal state, then: just Right.

>
> I don't see the value in this though requiring that it be part of the
> 'text'. I could easily write something like:

The value of choosing Unicode is, that when processing text,
 you don't have to worry if you picked the right code table to interpret
the text into characters, another "win" is that on many platforms
UTF-8 is already being used as the default encoding where
things are not completely encoding-independent so you actually
do not have to do any transcoding to view the data as UTF-8 because
it already is UTF-8.

I'm not a machine so when I see written text I see characters
not byte-sequences and code tables and I would like to be
able to handle text in my programs at the level of code-points
if not logical characters too, and let the computer handle the
encoding validation, etc. just like I don't have to specify
how to encode a float or double.

Another matter is that if you send your text by whatever means
you choose to someone at the other end of the world, then
he will see the same characters without any need for
transcoding, locale-picking, mumbo jumbo.

>
>  typedef view<utf8_encoded> utf8;
>
> And have something like this be possible:
>
>  utf8 u("The quick brown fox jumps over the lazy dog.");
>
> Now, that's your default utf8-encoded view of the underlying string.
>
> Right?

Right.

>
>> Every time when I do not specify an encoding it is assumed
>> by default to be UTF-8 i.e. when I'm reading text from
>> a TCP connection or from a file I expect that it already is
>> UTF-8 encoded and would like the string (optionally or always)
>> to validate it for me.
>>
>
> Hmmm... So then it's just a matter of using a type similar to what I
> pointed out above as the default then?

Yes.

>
> I don't see why the default and the other encoding case are really
> that different from an interface perspective. The underlying string
> will still be a series of bytes in memory, and encoding is just a
> matter of viewing it a given way. Right?

if there is ...

typedef something_beyond_my_current_level_of_comprehension native_encoding;
typedef ... native_wide_encoding;

... which works as I described above with your view ...

text my_text = init();

and I can do for example:
ShellExecuteW(
    ...,
    cstr(view<native_wide_encoding>(my_text)),
    ...
);

... (and we can have some shorthand for c_str(view<native_wide_encoding>(x))),
then, basically, Right.

>
> So what if `typedef view<Encoding> utf8` was there how far would that
> be from the default encoding case? And why does it have to be
> especially UTF for that matter?

Because it is an already acepted and widely used text-encoding,
standard capable of representing (basically) every writing system
known to mankind, even some of the invented ones (Quenia, Klingon,...)
*at once* (in the same text) without any encoding switching mumbo-jumbo
being used :)
If this is not obvious; I live in a part of the world where ASCII
is just not enough and, believe me, we are all tired here of juggling
with language-specific code-pages :)

>
>>>
[snip/]
>>
>> generally speaking the syntax is not that important for me
>> I can get used to almost everything :) so c_str(my_str) is
>> OK with me, if it does not involve just copying the string
>> whatever the internal representation is. As Robert said
>> if the internal string data already is non-contiguous then
>> this should be no-op.
>>
>> boost::string s = get_huge_string();
>> s = s ^ get_another_huge_string();
>> s = s ^ get_yet_another_huge_string();
>> std::string(s).c_str()
>>
>> is too inefficient for my taste.
>
> Why is it inefficient when there's no need for an actual copy to be involved?

No unnecessary copying => /me not complaining :)

>
>
> typedef view<utf8_encoding> utf8;
>
> I don't see why that shouldn't work for your requirements. :)

Make that

typedef view<utf8_encoding> text;

and I will be completely happy and very grateful. :-)
we can always work out the little things in the process.

Matus


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk