Boost logo

Boost :

Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
From: Soares Chen Ruo Fei (crf_at_[hidden])
Date: 2011-08-12 13:13:27


On Thu, Aug 11, 2011, Artyom Beilis wrote:
> My strong opinion is:
>
> a. Strings should be just container object with default encoding
> and some useful API to handle it.
> b. Default encoding MUST be UTF-8
> c. There are several ways to implement strings COW, Mutable, Immutable,
> with small string optimization and so on. This way or other
> std::string is de-facto string and I think we should live with
> it and use some alternative containers where it matters.
> d. Code point and code unit are meaningless unless you develop
> some Unicode algorithm - and you don't - you use one written
> by experts.

> This Ustr does not solve this problem as it does not provide
> really some kind of
>
> adapter<generic encoding> {
> string content
> }
>
>
> This is some kind of thing that may be useful, but not in
> this case. Basically your library provides wrapper
> around string and outputs Unicode code points but it
> does it for UTF encodings only!
>
>
> It does not benefit too much. You provide encoding traits
> but it is basically meaningless for the propose you had given
> as:
>
> It does not provide traits for non-Unicode encodings
> like lets say Shift-JIS or ISO-8859-8

The library is designed to be flexible without intending to include
every possible encodings to the library by default. The point is that
external developers can leverage the EncodingTraits template parameter
to implement the desired encoding *themselves*. The core library
should be as small as possible without being bloated by translation
tables between encodings that are not commonly used by the rest of the
world. You may request me to add a sub-library for Shift-JIS or other
encodings, and I'll consider implementing it for popular demand.

> BTW you can't create traits for many encodings, for
> example you can't implement traits requirements:
>
> http://crf.scriptmatrix.net/ustr/ustr/advanced.html#ustr.advanced.custom_encoding_traits
>
>
> For popular encodings like Shift-JIS or GBK...
>
> Homework: tell me why ;-)

I was trying to write a few lines of prototype code to show you that
it'd work, but I've run out of time and missed so many reply so I'll
show you next time. But why not? There are already standard
translation table offered by Unicode Consortium at
ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT,
so all it was needed to make the encoder/decoder work with that
translation table.

Perhaps you are referring to the non-roundtrip conversion of some
Shift-JIS characters, as mentioned by Microsoft at
http://support.microsoft.com/kb/170559. But the objective is to make
best effort emulation not completely perfect. Such problem cannot be
solved by any other implementation means anyway, so if you are trying
to convert Shift-JIS strings to Unicode before passing it to
Unicode-oriented functions, you are still screwed the same way.

Or perhaps you mean that you can't properly encode non-Japanese
characters into the Shift-JIS encoding. In that case the character
will just be substituted by a replacement character, or throw an
exception according to the provided policy. But the user of such
Unicode-emulated string should only read and not modify on it anyway,
i.e. it is probably a bad idea to create a
unicode_string_adapter_builder instance of the string and pass it to
mutation function that assumes full Unicode encoding functionality.
Then again you are also screwed the same way if you are trying to do
manual conversion from Unicode-encoded string to Shift-JIS encoded
string and pass it to Shift-JIS-oriented function.

The conclusion is that Boost.Ustr with custom encoding traits is
intended for convenience of automatically converting between two
encoding between library boundaries. It will not solve any encoding
conversion problem that you can't even solve with manual conversion.

> Also it is likely that encoding is something that
>
> can be changed in the runtime not compile time and
> it seems that this adapter does not support such
> option.

It is always possible to add a dynamic layer on top of the static
encoding layer but not the other way round. It shouldn't be too hard
to write a class with virtual interfaces to call the proper template
instance of unicode_string_adapter. But that is currently outside of
the scope and the fundamental design will still be static regardless.

> If someone uses strings with different encodings he usually
> knows their encoding...
>
> The problem is that API inconsistent as on Windows narrow
> string is some ANSI code page and anywhere else it is UTF-8.
>
> This is entirely different problem and such adapters don't
> really solve them but actually make it worse...

If I'm not wrong however, the wide version of strings on Windows is
always UTF-16 encoded, am I correct?

So a manual solution of constructing UTF-8 strings on Windows would be
similar to:

std::wstring wide_str = L"¥@¬É§A¦n";
std::string u8_str;
generic_conversion::u16_to_u8(wide_str.begin(), wide_str.end(),
std::back_inserter(u8_str));

except that it will not be portable across Unix systems. But with
Boost.Ustr you can achieve the same thing by

unicode_string_adapter<std::string> u8_str = USTR("¥@¬É§A¦n");

which gets expanded into

unicode_string_adapter<std::string> u8_str =
unicode_string_adapter<std::wstring>( std::wstring(L"¥@¬É§A¦n") );

So while it's hard to construct UTF-8 string literals on Windows, it
should still be possible by writing code that manually insert UTF-8
code units into std::string. After all std::string does not have any
restriction on which bytes we can manually insert into it.

Perhaps you are talking about printing the string out to std::out, but
that's another hard problem that we have to tackle separately.

> Other problem is
> ================
>
> I don't believe that string adapter would solve any real problems
> because:
>
> a) If you iterate over code points you are very likely do something
> wrong. As code point != character and this is very common mistake.

I am well aware of it but I decide to separate the concern into
different layers and tackle the abstract character problem at a higher
layer. The unicode_string_adapter class has a well defined role which
is to offer *code point* level access. I did plan to write another
class that does the character iteration, but I don't have enough time
to do it yet.

But I believe you can pass the code point iterators to Mathias'
Boost.Unicode methods to create abstract character iterators that does
the job you want.

> b) If you want to iterate over code points it is better to have some
> kind of utf_iterator that receives a range and iterate over it,
> it would be more generic and do not require to have an additional
> class.
>
> For example Boost.Locale has utf_traits that allow to implement
> iteration over code points quite easily.
>
> See:
> http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/namespaceboost_1_1locale_1_1utf.html
> http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/structboost_1_1locale_1_1utf_1_1utf__traits.html
>
> And you don't need any kind of specific adapters.

I added a static method
unicode_string_adapter::make_codepoint_iterator to do what you have
requested. It accepts three code unit iterator parameters, current,
begin, and end, so that it can iterate in both directions without
going out of bound. Hope that is what you looking for.

> c) The problem in Boost is not missing Unicode String and it is not
> even required to have yet-another-unicode-string that we have
> good Unicode support.
>
> The problem is policy the problem is Boost just can't decide once
> and forever that std::string is UTF-8...
>
> But don't get me wrong. This is My Opinion, many
> would disagree with me.
>
> Bottom line,
>
> Unicode strings, cool string adapters, UTF-iterators
> and even Boost.Unicode and Boost.Locale would not solve
> the problems that Boost libraries use inconsistent
> encodings on different platforms.
>
> IMHO: the only way to solve it is POLICY.

It is ok if you disagree with my approach, but keep in mind that the
focus of this thread is just to tell whether my current proposed
solution is good, and not to propose a better solution.

Thanks.

cheers,

Soares


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk