Boost logo

Boost :

Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
From: Matus Chochlik (chochlik_at_[hidden])
Date: 2011-08-12 08:04:59


On Fri, Aug 12, 2011 at 1:08 PM, Yakov Galka <ybungalobill_at_[hidden]> wrote:
> On Fri, Aug 12, 2011 at 12:00, Matus Chochlik <chochlik_at_[hidden]> wrote:
>
>> On Fri, Aug 12, 2011 at 9:57 AM, Daniel James <dnljms_at_[hidden]> wrote:
>> > On 11 August 2011 12:57, Artyom Beilis <artyomtnk_at_[hidden]> wrote:
>> >>
[...]
>>
>>  // by default expect UTF8
>>  text(const std::string& str)
>>  {
>>     assert(is_utf8(str.begin(), str.end()));
>>     store(str);
>>  }
>>
>
> What you are doing is, in fact, forcing the assumed encoding of std::string
> to UTF-8. You just said you think it's a bad idea.

No, I'm proposing to implement a *new* class that
will store the text in UTF8 encoding and if during
the construction no encoding is specified, then it
is assumed that the particular std::string is already
in UTF8.

This is *very* different from imposing
an encoding on std::string which is already
used in many situations with other encodings.
i.e. my approach does not break any existing code.

>
>
>> [...]
>> text t1 = "blahblah"; // must be utf8
>>
>> // whatever encoding the compiler uses for wide literals
>> text t2(L"blablablabl", textenc::compiler());
>>
>> text t3(some_posix_function(), textenc::posix());
>>
>> text t4(SomeWinapiFunc(), textenc::winapi());
>> text t5(SomeWinapiFuncW(), textenc::winapi());
>>
>
> How is it better than:
> string t4 = from_narrow(SomeWinapiFuncA()); // use the default encoding used
> by system for narrow strings
> string t5 = from_wide(SomeWinapiFuncW()); // wchar_t on windows is always
> utf16

I believe that it is more generic to use a combination
of function + tag than just a function because there
are other APIs besides the OS's that use various
encodings and my approach scales better.

r do you like from_narrow_os(), from_wide_os(),
from_narrow_stdlib(), from_narrow_lib1(), ...
from_wide_libN(); more ?

>
>
>> text t6(pq_some_func(), textenc::libpq());
>>

>
> You don't need it. You're proposing a design that tries to solve a
> non-existing problem. There is no such diversity of encodings in the
> interfaces. I don't know what is libpq, but it either uses UTF-8 in which
> case you write:
>
> string t6 = pq_some_func();
>
> or the default system encoding, in which case you write:

This is just an example. OK libpq already uses UTF8
but there are others that do not.

Besided it does not harm you in any way to do this
because if the returned string is already in UTF8
you would not do any conversion but if you are
using a very old version of libpq (not using UTF8),
the transcoding would be handled automatically.

The same with other libraries/APIs.

>
> string t6 = from_narrow(pq_some_func());
>
> As you start using more libraries with UTF-8 default encoding, you will use
> from_* less frequently.
> (It's possible to use a single to_utf8 instead of from_narrow/from_wide
> combination.)

The approach that I proposed *does not* force you to specify
the utf8 encoding explicitly neither if you are 100% sure that
the string is in UTF8 and that this does not change
under any circumstances (like when somebody changes the locale)

>
> [...]
>> SomeWinapiFunction(t8.str(textenc::winapi()).c_str());
>> SomeWinapiFunctionW(concat(t9, text::newline(),
>> t8).wstr(textenc::winapi()).c_str());
>>
>
> Same as above. 'text' as a distinct type doesn't play any role here. If t9
> is std::string, this becomes:
>
> SomeWinapiFunctionA(to_narrow(t8).c_str()); // to the default narrow
> system-encoding.

And what if t8 was read from another source (not UTF8 and not WINAPI)
which may for example use the locale's encoding or some arbitrary
encoding?

> SomeWinapiFunctionW(to_wide(t9 + "\r\n" + t8).c_str()); // what kind of
> newline is expected defined by the API, not the system.

>
>
>> [...]
>> i.e. besides the fact that the string "uses utf8" (there is already
>> a whole heap of such strings) it must also handle all the conversions
>> between utf8 and whatever the OS and the major libraries and
>> APIs expect and use; conveniently (and effectively).
>> Otherwise the effort is IMHO wasted.
>>
>
> Your 'text' doesn't do this in a transparent way. In fact you cannot do it
> in transparent way because 'const char*' doesn't carry the necessary
> semantic information. The burden of deciding what encoding to convert
> to/from falls on the programmer *anyway*. You don't benefit anything from
> defining yet-another string type.

I didn't say *transparent* I said *convenient*.
Of course you cannot do this completely
transparently because of the reasons you
mentioned (const char* can be encoded in any way).

You need to specify the source from which
the text comes (by the symbolic tag) and the library
handles the details for you. If the source is UTF8
do nothing otherwise do the transcoding.

And by doing to_narrow/from_narrow
you are trying to do that "transparently".
But again, that are other sources
of text which use other encodings, besides
the OS API.

>
> Boost libraries (at the very least those wrapping OS functionality)
>> should adopt this text class, and do the conversions, "just-in-time"
>> when making the OS API call.
>>
>
> In the light of the said above, your 'text' class won't catch bugs like:
>
> char str[1024];
> GetWindowTextA(hwnd, str, sizeof(str));
> boost::function_with_text_parameter(str);

No I didn't suggest doing it this way
so sorry but this is strawman.
This should look like:

char cstr[1024];
GetWindowTextA(hwnd, cstr, sizeof(cstr));
text str(cstr, textenc::winapi());
boost::function_with_text_parameter(str);

Best,

Matus


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk