Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Dean Michael Berris (mikhailberis_at_[hidden])
Date: 2011-01-26 12:26:43


On Thu, Jan 27, 2011 at 12:43 AM, Matus Chochlik <chochlik_at_[hidden]> wrote:
> On Wed, Jan 26, 2011 at 5:06 PM, Dean Michael Berris
> <mikhailberis_at_[hidden]> wrote:
>> On Wed, Jan 26, 2011 at 11:19 PM, Matus Chochlik <chochlik_at_[hidden]> wrote:
> [snip/]
>>
>> Right, but others seem to want to know about the implementation
>> details to try and work out whether the overall interface being
>> designed is actually going to be a viable implementation. So while I
>> say "value semantics" others have asked how that would be implemented
>> and -- being the gratuitous typer that I am ;) -- I would respond. :D
>
> OK :)

:D

>>
>> So what would be the point of implementing a string "wrapper" that
>> knew its encoding as part of the type if you didn't want to know the
>> encoding in most of the cases? I think I'm missing the logic there.
>
> The logic would be that you no longer would have to
> be concerned if A' (A with acute), etc, is encoded as ISO-8859-2 says,
> or as UTF-8 says etc.. But you would *always* handle
> the string as a sequence of *Unicode* code-points or even
> "logical characters" and not as a sequence of bytes that are
> being somehow encoded (generally).
> I can imagine use-cases where it still would be OK to
> get the underlying byte-sequence (read-only) for things
> that are encoding-independent.
>

So really this wrapper is the 'view' that I talk about that carries
with it an encoding and the underlying data. Right?

>>
>> So we're obviously talking about two different strings here -- your
>> "text" that knows the encoding and the immutable string that you may
>> or may not build upon. How then do you design the algorithms if you
>> *didn't* want to explicitly specify the encoding you want the
>> algorithms to use?
>
> By saying that the *implicit* encoding is UTF-8 and that should
> I need to use another encoding I will treat it as a special case.

I don't see the value in this though requiring that it be part of the
'text'. I could easily write something like:

  typedef view<utf8_encoded> utf8;

And have something like this be possible:

  utf8 u("The quick brown fox jumps over the lazy dog.");

Now, that's your default utf8-encoded view of the underlying string.

Right?

> Every time when I do not specify an encoding it is assumed
> by default to be UTF-8 i.e. when I'm reading text from
> a TCP connection or from a file I expect that it already is
> UTF-8 encoded and would like the string (optionally or always)
> to validate it for me.
>

Hmmm... So then it's just a matter of using a type similar to what I
pointed out above as the default then?

> Then there are two cases:
> a) Default encoding of std::string depending upon std::locale
> and encoding of std::wstring which is for example on Windows
> be default treated as being encoded with UTF-16 and on Linux
> as being encoded as UTF-32.
> For these I would love to have some simple means of saying
> to 'boost::text' give me your representation in the encoding
> that std::string is expected to be encoded in or "build" yourself
> from the native encoding, that std::string is supposed to be using.
> + the same for wstring.
>
> b) Every other encoding. For example if I really needed
> to convert my string to IBM CP850 because I want
> to send it to an old printer then only in this case should
> I be required (obviously) to specify the encoding explicitly.
>

I don't see why the default and the other encoding case are really
that different from an interface perspective. The underlying string
will still be a series of bytes in memory, and encoding is just a
matter of viewing it a given way. Right?

>>
>> In one of the previous messages I laid out an algorithm template like so:
>>
>>  template <class String>
>>  void foo(String s) {
>>    view<encoding> encoded(s);
>>    // deal with encoded from here on out
>>  }
>>
>> Of course then from foo's user perspective, she wouldn't have to do
>> anything with his string to be passed in. From the algorithm
>> implementer perspective you would know exactly what encoding was
>> wanted and how to go about implementing the algorithm even potentially
>> having something like this as well:
>>
>>  template <class Encoding>
>>  void foo(view<Encoding> encoded) {
>>    // deal with the encoded string appropriately here
>>  }
>>
>> And you get the benefits in either case of being able to either
>> explicitly or implicitly deal with strings depending on whether they
>> have been explicitly encoded already or whether it's just a raw set of
>> bytes.
>
> I see that this is OK for many use cases. But having a single
> pre-defined, default encoding, has also it's advantages, because
> usually you can skip the whole view<Encoding> part.
>

So what if `typedef view<Encoding> utf8` was there how far would that
be from the default encoding case? And why does it have to be
especially UTF for that matter?

>>
>> So, if there was a way to "encode" (there's that word again) the data
>> in an immutable string into an acceptably-rendered `char const *`
>> would that solve the problem? The whole point of my assertion (and
>> Dave's question) is whether c_str() would have to be intrinsic to the
>> string, which I have pointed out in a different message (not too long
>> ago) that it could very well be an external algorithm.
>
> generally speaking the syntax is not that important for me
> I can get used to almost everything :) so c_str(my_str) is
> OK with me, if it does not involve just copying the string
> whatever the internal representation is. As Robert said
> if the internal string data already is non-contiguous then
> this should be no-op.
>
> boost::string s = get_huge_string();
> s = s ^ get_another_huge_string();
> s = s ^ get_yet_another_huge_string();
> std::string(s).c_str()
>
> is too inefficient for my taste.
>

Why is it inefficient when there's no need for an actual copy to be involved?

s ^ get_huge_string()

would basically yield a lazily composed concatenation which could just
hold references to the original strings (again, with potential for
optimizations depending on the length of the strings, etc.).

So then you can layer that up and just need to linearize it when it's
actually required -- in the conversion for the std::string case. And
if you really wanted to just linearize the string into a void * buffer
somewhere then that should be perfectly fine as well.

I guess assuming that you have actual temporaries built (like how
std::string would have you believe) when concatenating strings will
make it look like it's really inefficient, but there should be a way
of making it more efficient *because* the string is immutable.

>>
>> Right. This is Boost anyway, and I've always viewed libraries that get
>> proposed to an accepted into Boost are the kinds of libraries that are
>> developed to eventually be made part of the C++ standard library.
>>
>> So while out of the gate the string implementation can very well be
>> not called std::string, I don't see why the current std::string can't
>> be deprecated later on (look at std::auto_ptr) and a different
>> implementation be put in its place? :D Of course that may very well be
>> C++21xx so I don't think I need to worry about it having to be a
>> std::string killer in the outset. ;)
>
> If you pull this off (replacing std::string without having a transition period
> with a backward compatible interface) then you will be my personal hero. :-)

Well don't hold your breath for that because, well, you won't have
'erase' and other things that std::string supports, so it won't be
backward compatible to std::string. :)

> Wait .. provided, that the encoding-related stuff I said above
> will be part of the string :) or there will be some wrapper around it
> providing that functionality.
>

typedef view<utf8_encoding> utf8;

I don't see why that shouldn't work for your requirements. :)

-- 
Dean Michael Berris
about.me/deanberris

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk