Boost logo

Boost :

Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2011-08-15 07:08:14


On Fri, Aug 12, 2011 at 15:04, Matus Chochlik <chochlik_at_[hidden]> wrote:

> On Fri, Aug 12, 2011 at 1:08 PM, Yakov Galka <ybungalobill_at_[hidden]>
> wrote:
> > On Fri, Aug 12, 2011 at 12:00, Matus Chochlik <chochlik_at_[hidden]>
> wrote:
> >
> >> On Fri, Aug 12, 2011 at 9:57 AM, Daniel James <dnljms_at_[hidden]> wrote:
> >> > On 11 August 2011 12:57, Artyom Beilis <artyomtnk_at_[hidden]> wrote:
> >> >>
> [...]
> >>
> >> // by default expect UTF8
> >> text(const std::string& str)
> >> {
> >> assert(is_utf8(str.begin(), str.end()));
> >> store(str);
> >> }
> >>
> >
> > What you are doing is, in fact, forcing the assumed encoding of
> std::string
> > to UTF-8. You just said you think it's a bad idea.
>
> No, I'm proposing to implement a *new* class that
> will store the text in UTF8 encoding and if during
> the construction no encoding is specified, then it
> is assumed that the particular std::string is already
> in UTF8.
>

> This is *very* different from imposing
> an encoding on std::string which is already
> used in many situations with other encodings.
> i.e. my approach does not break any existing code.
>

Sorry, your arguments start to look non-constructive to me. Correct me where
I'm wrong in the following reasoning.

(1) You object to UTF-8 strings in boost interface because someone may pass
something other than UTF-8 there and it's going to be undetected at compile
time:

namespace boost { void func(const std::string& a); } // UTF-8
boost::func(non_utf_string); //oops

You're proposing a `text` class that is meant to somehow overcome this
problem. So you change the boost interface to accept `text` but user code is
left unchanged...:

namespace boost { void func(const text& a); }
boost::func(non_utf_string); //oops, the std::string default constructor is
called.

Yes, you can make this constructor explicit, so the above code stops
compiling and the user must write explicitly:
boost::func(text(non_utf_string));

But then there is nothing in your proposal that makes std::string utf-8
encoded by 'default'. Default == implicit.

[...]
>
> I believe that it is more generic to use a combination
> of function + tag than just a function because there
> are other APIs besides the OS's that use various
> encodings and my approach scales better.
>
> r do you like from_narrow_os(), from_wide_os(),
> from_narrow_stdlib(), from_narrow_lib1(), ...
> from_wide_libN(); more ?
>

(2) No, I'd never proposed that. I repeat this again: The only encodings
which matter are 'system default', UTF-8, and UTF-16. I would like to see a
list of widely used libraries which use other encodings, please. [ Note: Not
including libraries used for encoding conversions. — end note. ] Even if
there is such a library out there, the user is *already* converting to/from
its exotic encoding.

> [...]
> Besided it does not harm you in any way

It does. I already use UTF-8 for all my strings, even on windows, and I
don't want the code-bloat of all these conversions (even if they're no-ops).

> [...]
> >
> > string t6 = from_narrow(pq_some_func());
> >
> > As you start using more libraries with UTF-8 default encoding, you will
> use
> > from_* less frequently.
> > (It's possible to use a single to_utf8 instead of from_narrow/from_wide
> > combination.)
>
> The approach that I proposed *does not* force you to specify
> the utf8 encoding explicitly neither if you are 100% sure that
> the string is in UTF8 and that this does not change
> under any circumstances (like when somebody changes the locale)
>

Huh? Neither mine. Again, what you say here contradicts (1).

> [...]
> And what if t8 was read from another source (not UTF8 and not WINAPI)
> which may for example use the locale's encoding or some arbitrary
> encoding?
>

See (2).

[...]
> You need to specify the source from which
> the text comes (by the symbolic tag) and the library
> handles the details for you. If the source is UTF8
> do nothing otherwise do the transcoding.
>

So can I summarize this debate as 'the programmer specifies the library and
boost chooses the encoding' versus 'the programmer goes to the documentation
of the library and says boost what encoding to use'? If yes, then it's a
quite minor design decision. According to (2) I claim that there will be
less encodings than libraries.

And by doing to_narrow/from_narrow
> you are trying to do that "transparently".
> But again, that are other sources
> of text which use other encodings, besides
> the OS API.
>

See (2).

>
> >
> > Boost libraries (at the very least those wrapping OS functionality)
> >> should adopt this text class, and do the conversions, "just-in-time"
> >> when making the OS API call.
> >>
> >
> > In the light of the said above, your 'text' class won't catch bugs like:
> >
> > char str[1024];
> > GetWindowTextA(hwnd, str, sizeof(str));
> > boost::function_with_text_parameter(str);
>
> No I didn't suggest doing it this way
> so sorry but this is strawman.
> This should look like:
>
> char cstr[1024];
> GetWindowTextA(hwnd, cstr, sizeof(cstr));
> text str(cstr, textenc::winapi());
> boost::function_with_text_parameter(str);
>

Neither I suggested passing non-utf-8 string to a utf-8 assumed string. It's
not about the way you proposed to write the code, it's about your proposal
doesn't solve the problem it was advocated to solve and be better from
Artyom's and my proposal. See (1). You say that there is some code that you
don't want to break, code you want to be compatible with. Which code? This
code:

char str[1024];
GetWindowTextA(hwnd, str, sizeof(str));
boost::function_with_text_parameter(str); // currently assumes system
encoding

Let's leave aside the fact that this code uses deprecated winapi interface
and thus unicode-unaware. Yes, it *should* be written as:

char cstr[1024];
GetWindowTextA(hwnd, cstr, sizeof(cstr));
text str(cstr, textenc::winapi());
boost::function_with_text_parameter(str);

BUT (!!!), until the user rewrites this code, you've silently broke his
code. This is *exactly* the same situation as assuming std::string is utf-8
in the first place, and your way of how the user had to write his code is
almost the same as mine:

char str[1024];
GetWindowTextA(hwnd, str, sizeof(str));
boost::function_with_text_parameter(from_narrow(str)); // accepts UTF-8
std::string
// versus:
boost::function_with_text_parameter(text(str, textenc::winapi)); // accepts
your::text

(3) The only way to avoid silent breakage is to trap it at compile time
through disabling the implicit conversion from string and char*. (!!!) By
making the constructor explicit you just break user-code at compile time
rather than (silently) at run-time. Indeed it's a bit better than assuming
utf-8 by default, but now your string is going to be hell to use, even for
those who already use utf-8 encoded std::strings:

std::string str = get_utf_8_string();
boost::function_with_text_parameter(str); // error, explicit constructor of
text not called. please specify you intent.
boost::function_with_text_parameter(text(str)); // wait, don't we want to
encourage utf-8 std::strings?

-- 
Yakov

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk