|
Boost : |
Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
From: Artyom Beilis (artyomtnk_at_[hidden])
Date: 2011-08-11 07:03:45
>From: Soares Chen Ruo Fei <crf_at_[hidden]>
>To: boost_at_[hidden]
>Sent: Tuesday, August 9, 2011 10:53 AM
>Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
>
>My post has probably slipped through the radar so I'm just going to
>bump this post again. Please feel free to criticize if you think that
>my library has any fundamental design flaw.
> As a student and GSoC
>participant, I think the most important thing is for me is to learn
>what I did wrong in the project so that I will not repeat the same
>mistake, and also to allow me to gain enough experience so that I can
>really give useful contribution to the open source community in
>future.
>
>Any feedback is really much appreciated. Thanks.
>
Hello,
First of all I want to tell that I'm as the author of Boost.Locale
library have very strong opinion on how strings and Unicode should
be handled.
My strong opinion is:
a. Strings should be just container object with default encoding
and some useful API to handle it.
b. Default encoding MUST be UTF-8
c. There are several ways to implement strings COW, Mutable, Immutable,
with small string optimization and so on. This way or other
std::string is de-facto string and I think we should live with
it and use some alternative containers where it matters.
d. Code point and code unit are meaningless unless you develop
some Unicode algorithm - and you don't - you use one written
by experts.
So my biggest problem is motivation:
-----------------------------------
> The main reason that Boost.Ustr is developed is because current
> raw string types such as std::string requires developers to make
> assumption on the encoding of the string content, such as UTF-8
> for std::string. This creates inconsistency when a string passed
> to library APIs has different encoding from the library expects.
This Ustr does not solve this problem as it does not provide
really some kind of
adapter<generic encoding> {
string content
}
This is some kind of thing that may be useful, but not in
this case. Basically your library provides wrapper
around string and outputs Unicode code points but it
does it for UTF encodings only!
It does not benefit too much. You provide encoding traits
but it is basically meaningless for the propose you had given
as:
It does not provide traits for non-Unicode encodings
like lets say Shift-JIS or ISO-8859-8
BTW you can't create traits for many encodings, for
example you can't implement traits requirements:
http://crf.scriptmatrix.net/ustr/ustr/advanced.html#ustr.advanced.custom_encoding_traits
For popular encodings like Shift-JIS or GBK...
Homework: tell me why ;-)
Also it is likely that encoding is something that
can be changed in the runtime not compile time and
it seems that this adapter does not support such
option.
> The problem mainly arise because there are a small minority of
> developers who use different encoding for the same string type.
If someone uses strings with different encodings he usually
knows their encoding...
The problem is that API inconsistent as on Windows narrow
string is some ANSI code page and anywhere else it is UTF-8.
This is entirely different problem and such adapters don't
really solve them but actually make it worse...
Other problem is
================
I don't believe that string adapter would solve any real problems
because:
a) If you iterate over code points you are very likely do something
wrong. As code point != character and this is very common mistake.
b) If you want to iterate over code points it is better to have some
kind of utf_iterator that receives a range and iterate over it,
it would be more generic and do not require to have an additional
class.
For example Boost.Locale has utf_traits that allow to implement
iteration over code points quite easily.
See:
http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/namespaceboost_1_1locale_1_1utf.html
http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/structboost_1_1locale_1_1utf_1_1utf__traits.html
And you don't need any kind of specific adapters.
c) The problem in Boost is not missing Unicode String and it is not
even required to have yet-another-unicode-string that we have
good Unicode support.
The problem is policy the problem is Boost just can't decide once
and forever that std::string is UTF-8...
But don't get me wrong. This is My Opinion, many
would disagree with me.
=================================
Bottom line,
Unicode strings, cool string adapters, UTF-iterators
and even Boost.Unicode and Boost.Locale would not solve
the problems that Boost libraries use inconsistent
encodings on different platforms.
IMHO: the only way to solve it is POLICY.
Artyom Beilis
--------------
CppCMS - C++ Web Framework: http://cppcms.sf.net/
CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk