Boost logo

Boost :

Subject: Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World
From: Yakov Galka (ybungalobill_at_[hidden])
Date: 2012-01-29 01:08:33


On Sun, Jan 29, 2012 at 02:49, Mathias Gaunard <mathias.gaunard_at_[hidden]
> wrote:

> On 01/28/2012 08:48 PM, Yakov Galka wrote:
>
>> The user can just write
>>
>> cout<< u8"您好世界";
>>
>> Even better is:
>>
>> cout<< "您好世界";
>>
>> which *just works* on most compilers (e.g. GCC:
>> http://ideone.com/lBpMJ)
>> and needs some trickery on others (MSVC: save as UTF-8 without BOM).
>>
>
> No, that's just wrong.
> That's not the model that C++ uses. By not storing it with the BOM, you're
> essentially tricking MSVC into believing it is ANSI (windows-1252 on
> western systems), and thus avoiding source character set to the execution
> character set, since those happen to be the same.
>
> The way a C++ compiler is supposed to work is that all of your source is
> in the source character set, regardless of the type of string literal you
> use.
> Then the compiler will convert your source character set to the execution
> character set for narrow string literals, to the wide execution character
> set for wide string literals, to UTF-8 for u8 literals, etc.
>

Sorry for not being clear enough. I agree and I've not said otherwise. The
second 'cout' line *is* a hack. I admit it won't work if you mix such
string literals with wide literals or external identifiers containing
Unicode. The intent was to show how it could be done if the effort was
focused on making narrow string literals "Unicode compatible".

[...] What probably should be done is that compilers should be compelled to
> support UTF-8 as the source character set in a unified way.
>

Yes, it could be nice. It would solve half the problem, which is a huge
step forward given the current mood of the committee. However, embedding
Unicode string literals in source code is still not something you routinely
do. Internationalization usually uses external string tables.

I once asked volodya if it were feasible to implement this in the build
> system (add a BOM for MSVC), but he didn't seem to think it was worth it.

I don't understand. MSVC already understands BOM, and GCC has already been
fixed according to
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33415(didn't test it).

On Sun, Jan 29, 2012 at 03:12, Mathias Gaunard <mathias.gaunard_at_[hidden]
> wrote:

> I think you should consider the points being made in N3334.
> While that proposal is in my opinion not good enough, it raises an
> important issue that is often present with std::string-based or similar
> designs.
>
> A function that takes a std::string, or a boost::filesystem::path for that
> matter, necessarily causes the callee to copy the data into a
> heap-allocated buffer, even if there is no need to.
>
> Use of the range concept would solve that issue, but then that requires
> making the function a template. A type-erased range would be possible, but
> that has significant performance overhead.
> a string_ref or path_ref is maybe the lesser evil.
>

+1
This topic has been raised here in program-options context:
http://boost.2283326.n4.nabble.com/program-options-Some-methods-take-const-char-others-take-std-string-td3733894.html

-- 
Yakov

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk