Boost logo

Boost :

Subject: Re: [boost] [string] proposal
From: Matus Chochlik (chochlik_at_[hidden])
Date: 2011-01-26 11:43:33


On Wed, Jan 26, 2011 at 5:06 PM, Dean Michael Berris
<mikhailberis_at_[hidden]> wrote:
> On Wed, Jan 26, 2011 at 11:19 PM, Matus Chochlik <chochlik_at_[hidden]> wrote:
[snip/]
>
> Right, but others seem to want to know about the implementation
> details to try and work out whether the overall interface being
> designed is actually going to be a viable implementation. So while I
> say "value semantics" others have asked how that would be implemented
> and -- being the gratuitous typer that I am ;) -- I would respond. :D

OK :)
>
>>>
>>> I still don't understand this though. What does encoding have to do
>>> with the string? Isn't encoding a separate process?
>>
>> Hm, my ability to express myself obviously totally su*ks :)
>> you are completely right, that the encoding is a completely
>> separate process, and I'm saying that I want it *completely*
>> to be hidden from my sight, unless it is absolutely necessary
>> for me to be concerned about it :-)
>>
>
> So what would be the point of implementing a string "wrapper" that
> knew its encoding as part of the type if you didn't want to know the
> encoding in most of the cases? I think I'm missing the logic there.

The logic would be that you no longer would have to
be concerned if A' (A with acute), etc, is encoded as ISO-8859-2 says,
or as UTF-8 says etc.. But you would *always* handle
the string as a sequence of *Unicode* code-points or even
"logical characters" and not as a sequence of bytes that are
being somehow encoded (generally).
I can imagine use-cases where it still would be OK to
get the underlying byte-sequence (read-only) for things
that are encoding-independent.

>
>> The means for this would be: Let us build a string, that may
>> (or may not) be based on your general (encoding agnostic)
>> string. And this string would handle the transcoding in most
>> cases without me viewing the underlying byte sequence
>> by functors that need me *everytime* to specify what encoding
>> I want explicitly. By default I want UTF-8, if I talk to the OS I
>> say I want the string in an encoding that the OS expects, not
>> that I want it in UTF-16, ISO-8859-2, KOI8-R, etc.
>> If and only if I want to handle the string in another encoding
>> than Unicode should I have to specify that explicitly.
>>
>
> So we're obviously talking about two different strings here -- your
> "text" that knows the encoding and the immutable string that you may
> or may not build upon. How then do you design the algorithms if you
> *didn't* want to explicitly specify the encoding you want the
> algorithms to use?

By saying that the *implicit* encoding is UTF-8 and that should
I need to use another encoding I will treat it as a special case.
Every time when I do not specify an encoding it is assumed
by default to be UTF-8 i.e. when I'm reading text from
a TCP connection or from a file I expect that it already is
UTF-8 encoded and would like the string (optionally or always)
to validate it for me.

Then there are two cases:
a) Default encoding of std::string depending upon std::locale
and encoding of std::wstring which is for example on Windows
be default treated as being encoded with UTF-16 and on Linux
as being encoded as UTF-32.
For these I would love to have some simple means of saying
to 'boost::text' give me your representation in the encoding
that std::string is expected to be encoded in or "build" yourself
from the native encoding, that std::string is supposed to be using.
+ the same for wstring.

b) Every other encoding. For example if I really needed
to convert my string to IBM CP850 because I want
to send it to an old printer then only in this case should
I be required (obviously) to specify the encoding explicitly.

>
> In one of the previous messages I laid out an algorithm template like so:
>
>  template <class String>
>  void foo(String s) {
>    view<encoding> encoded(s);
>    // deal with encoded from here on out
>  }
>
> Of course then from foo's user perspective, she wouldn't have to do
> anything with his string to be passed in. From the algorithm
> implementer perspective you would know exactly what encoding was
> wanted and how to go about implementing the algorithm even potentially
> having something like this as well:
>
>  template <class Encoding>
>  void foo(view<Encoding> encoded) {
>    // deal with the encoded string appropriately here
>  }
>
> And you get the benefits in either case of being able to either
> explicitly or implicitly deal with strings depending on whether they
> have been explicitly encoded already or whether it's just a raw set of
> bytes.

I see that this is OK for many use cases. But having a single
pre-defined, default encoding, has also it's advantages, because
usually you can skip the whole view<Encoding> part.

>
[snip/]
>> This is a different matter, Again I may be wrong but I live
>> under the expression that RangeEx has been implemented
>> to hide the ugliness of complex STL iterator-based algorithms.
impression (of course) :)
>
> Of course the proof will be in the pudding. ;)
>
>>> I think we need to qualify what you refer to as APIs. If just judging
>>> from the amount of code that's written against Qt or MFC for example
>>> then I'd say "they're pretty well accepted". If you look at the
>>> libraries that use ICU as a backend I'd say we already have one in
>>> Boost called Boost.Regex. And there's all these other libraries in the
>>> Linux arena that have their own little niche to play in the Unicode
>>> game -- there's Glib, the GNOME and KDE libraries, ad nauseam.
>>
>> Besides what you mentioned an API for me is for example
>> WINAPI, POSIX API, OpenGL API, OpenSSL API, etc.
>> Basically all the functions "exported" by the various C/C++
>> libraries that I cannot imagine my life without :) and which
>> expect not a generic iterator range or a view or whatnot
>> but plain and simple pointer (const char*) pointing to a contiguous
>> block in memory containing a zero terminated C string,
>> or if we are luckier expects std::string.
>>
>
> So, if there was a way to "encode" (there's that word again) the data
> in an immutable string into an acceptably-rendered `char const *`
> would that solve the problem? The whole point of my assertion (and
> Dave's question) is whether c_str() would have to be intrinsic to the
> string, which I have pointed out in a different message (not too long
> ago) that it could very well be an external algorithm.

generally speaking the syntax is not that important for me
I can get used to almost everything :) so c_str(my_str) is
OK with me, if it does not involve just copying the string
whatever the internal representation is. As Robert said
if the internal string data already is non-contiguous then
this should be no-op.

boost::string s = get_huge_string();
s = s ^ get_another_huge_string();
s = s ^ get_yet_another_huge_string();
std::string(s).c_str()

is too inefficient for my taste.

>>
>
> Right. This is Boost anyway, and I've always viewed libraries that get
> proposed to an accepted into Boost are the kinds of libraries that are
> developed to eventually be made part of the C++ standard library.
>
> So while out of the gate the string implementation can very well be
> not called std::string, I don't see why the current std::string can't
> be deprecated later on (look at std::auto_ptr) and a different
> implementation be put in its place? :D Of course that may very well be
> C++21xx so I don't think I need to worry about it having to be a
> std::string killer in the outset. ;)

If you pull this off (replacing std::string without having a transition period
with a backward compatible interface) then you will be my personal hero. :-)
Wait .. provided, that the encoding-related stuff I said above
will be part of the string :) or there will be some wrapper around it
providing that functionality.

Matus


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk