Boost logo

Boost :

Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
From: Matus Chochlik (chochlik_at_[hidden])
Date: 2011-08-15 12:51:01


On Mon, Aug 15, 2011 at 1:08 PM, Yakov Galka <ybungalobill_at_[hidden]> wrote:
> On Fri, Aug 12, 2011 at 15:04, Matus Chochlik <chochlik_at_[hidden]> wrote:
>
[...]
>> >>
>> >>  // by default expect UTF8
>> >>  text(const std::string& str)
>> >>  {
>> >>     assert(is_utf8(str.begin(), str.end()));
>> >>     store(str);
>> >>  }
>> >>

OK, to clarify. I certainly do not insist on implicit
conversion from const char* and std::string.
In fact I would like this and the other constructor
to be explicit.

>> >
>> > What you are doing is, in fact, forcing the assumed encoding of
>> std::string
>> > to UTF-8. You just said you think it's a bad idea.
>>
>> No, I'm proposing to implement a *new* class that
>> will store the text in UTF8 encoding and if during
>> the construction no encoding is specified, then it
>> is assumed that the particular std::string is already
>> in UTF8.
>>
>
>> This is *very* different from imposing
>> an encoding on std::string which is already
>> used in many situations with other encodings.
>> i.e. my approach does not break any existing code.
>>
>
> Sorry, your arguments start to look non-constructive to me. Correct me where
> I'm wrong in the following reasoning.
>
> (1) You object to UTF-8 strings in boost interface because someone may pass
> something other than UTF-8 there and it's going to be undetected at compile
> time:
>
> namespace boost { void func(const std::string& a); } // UTF-8
> boost::func(non_utf_string); //oops

Yes, this is my concern, but see below.

>
> You're proposing a `text` class that is meant to somehow overcome this
> problem. So you change the boost interface to accept `text` but user code is
> left unchanged...:

No I never said that the client code should be left
unchanged and I'm sorry if you came to this conclusion
because I did not express myself clearly.

The text class should come with documentation
that clearly states that it uses Unicode and UTF8
and if you are constructing text from a string
than either you must be sure that the string already
is in UTF8 (which is not always the case so I'm not
enforcing std::string to be utf8), OR you must
specify (by the means of the symbolic tag)
where the string came from: the OS, some external
library that does not use UNICODE, nor the OS's
conventions.

I prefer the symbolic tags because the logic whether
the conversion needs to be done at all and from/to
which source/destination encoding will be hidden
from the user inside the library.

>
> namespace boost { void func(const text& a); }
> boost::func(non_utf_string); //oops, the std::string default constructor is
> called.
>
> Yes, you can make this constructor explicit, so the above code stops
> compiling and the user must write explicitly:
> boost::func(text(non_utf_string));

Yes this is the idea:
boost::func(text(non_utf_string, textenc::symbolic_tag()));

The advantage is that if the authors of the library which
produced the non_utf_string change their mind and
in a new version start to encode their strings in UTF8,
this code does not have to be touched. You update
the *text* library which will take the change into account
and recompile your application (using the code above).

>
> But then there is nothing in your proposal that makes std::string utf-8
> encoded by 'default'. Default == implicit.

The idea is that the documentation will say so. See above.

>
> [...]
>>
>> I believe that it is more generic to use a combination
>> of function + tag than just a function because there
>> are other APIs besides the OS's that use various
>> encodings and my approach scales better.
>>
>> r do you like from_narrow_os(), from_wide_os(),
>> from_narrow_stdlib(), from_narrow_lib1(), ...
>> from_wide_libN(); more ?
>>
>
> (2) No, I'd never proposed that. I repeat this again: The only encodings
> which matter are 'system default', UTF-8, and UTF-16. I would like to see a
> list of widely used libraries which use other encodings, please. [ Note: Not
> including libraries used for encoding conversions. — end note. ]

I'm not saying that every library uses some different encoding
but the situation is not as ideal as you put it neither (that
everybody uses the OS's conventions). But sorry, I'm not an
software API encyclopedia so no list ;).

> Even if
> there is such a library out there, the user is *already* converting to/from
> its exotic encoding.

Yes and this is precisely the most annoying part of working with
text in C++. I do not say that the proposed library is completely wrong,
but IMO the *ultimate Unicode string* must handle two things:

A) the *Unicode stuff* which the proposed library basically does
and there are others which do as well: Boost.Locale, Boost.Unicode

B) (equally important) handle *conveniently*
the conversions from/to external APIs.
What good is a Unicode library if you have to do for example the
*WINAPI-text-string-conversion-voodoo* before or after every
call to WINAPI of which there are hundreds (at least in the apps
that I work on). And these have to work with code using Qt, wxWidgets,
mysql, libpq, odbc, openGL, openSSL, xml-parsers, etc. etc.
many (I don't say all) have their conventions about text encoding
and these conventions change over time and are not always
consistent with the OS.

My proposal only allows you to hide annoying details in one
(easily extensible) library, which in turn results in cleaner,
less cluttered and more stable application code.

>
>
>> [...]
>> Besided it does not harm you in any way
>
>
> It does. I already use UTF-8 for all my strings, even on windows, and I
> don't want the code-bloat of all these conversions (even if they're no-ops).

Again (if you know that it is UTF8 you don't have to say it out loud):

>> The approach that I proposed *does not* force you to specify
>> the utf8 encoding explicitly neither if you are 100% sure that
>> the string is in UTF8 and that this does not change
>> under any circumstances (like when somebody changes the locale)
>>
>
> Huh? Neither mine. Again, what you say here contradicts (1).

I never said that your does, I'm just saying that that mine
doesn't either.

>
>
[...]
>> You need to specify the source from which
>> the text comes (by the symbolic tag) and the library
>> handles the details for you. If the source is UTF8
>> do nothing otherwise do the transcoding.
>>
>
> So can I summarize this debate as 'the programmer specifies the library and
> boost chooses the encoding' versus 'the programmer goes to the documentation
> of the library and says boost what encoding to use'? If yes, then it's a
> quite minor design decision. According to (2) I claim that there will be
> less encodings than libraries.

Bingo. To summarize my points:
Besides handling Unicode well, it must also play nice with
other libraries. To do that; to move the burden from
the application programmer and to remove code repetition -
hide the conversion logic, and let the user just say where
the text is coming from or going to.

And again I don't remember saying that there will be more
encodings than libraries. The libpq was just an example
(I use it more that 10 years and it wasn't always using UTF8,
like it does now).

> And by doing to_narrow/from_narrow
>> you are trying to do that "transparently".
>> But again, that are other sources
>> of text which use other encodings, besides
>> the OS API.
>>
[...]
>
> Neither I suggested passing non-utf-8 string to a utf-8 assumed string. It's
> not about the way you proposed to write the code, it's about your proposal
> doesn't solve the problem it was advocated to solve and be better from
> Artyom's and my proposal. See (1). You say that there is some code that you
> don't want to break, code you want to be compatible with. Which code? This
> code:

No, maybe I didn't say this clearly but I do not want
implicit conversion between text and string.

>
> char str[1024];
> GetWindowTextA(hwnd, str, sizeof(str));
> boost::function_with_text_parameter(str); // currently assumes system
> encoding
>
> Let's leave aside the fact that this code uses deprecated winapi interface
> and thus unicode-unaware. Yes, it *should* be written as:
>
> char cstr[1024];
> GetWindowTextA(hwnd, cstr, sizeof(cstr));
> text str(cstr, textenc::winapi());
> boost::function_with_text_parameter(str);

>
> BUT (!!!), until the user rewrites this code, you've silently broke his
> code. This is *exactly* the same situation as assuming std::string is utf-8
> in the first place, and your way of how the user had to write his code is
> almost the same as mine:
>
> char str[1024];
> GetWindowTextA(hwnd, str, sizeof(str));
> boost::function_with_text_parameter(from_narrow(str)); // accepts UTF-8
> std::string
> // versus:
> boost::function_with_text_parameter(text(str, textenc::winapi)); // accepts
> your::text

This is exactly how it should look like!

And I certainly do not have anything against a syntactic
sugar function like for example:

boost::function_with_text_parameter(text::from_os(str));

which would hide the ugly(?) tags and possible no-ops
in the most important cases of conversions, like the one
when you are talking to the OS API.

[...]

Matus


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk