Boost logo

Boost :

Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
From: Artyom Beilis (artyomtnk_at_[hidden])
Date: 2011-08-11 07:03:45


>From: Soares Chen Ruo Fei <crf_at_[hidden]> >To: boost_at_[hidden] >Sent: Tuesday, August 9, 2011 10:53 AM >Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter > >My post has probably slipped through the radar so I'm just going to >bump this post again. Please feel free to criticize if you think that >my library has any fundamental design flaw. > As a student and GSoC >participant, I think the most important thing is for me is to learn >what I did wrong in the project so that I will not repeat the same >mistake, and also to allow me to gain enough experience so that I can >really give useful contribution to the open source community in >future. > >Any feedback is really much appreciated. Thanks. > Hello, First of all I want to tell that I'm as the author of Boost.Locale library have very strong opinion on how strings and Unicode should be handled. My strong opinion is: a. Strings should be just container object with default encoding    and some useful API to handle it. b. Default encoding MUST be UTF-8 c. There are several ways to implement strings COW, Mutable, Immutable,    with small string optimization and so on. This way or other    std::string is de-facto string and I think we should live with    it and use some alternative containers where it matters. d. Code point and code unit are meaningless unless you develop    some Unicode algorithm - and you don't - you use one written    by experts. So my biggest problem is motivation: ----------------------------------- > The main reason that Boost.Ustr is developed is because current > raw string types such as std::string requires developers to make > assumption on the encoding of the string content, such as UTF-8 > for std::string. This creates inconsistency when a string passed > to library APIs has different encoding from the library expects. This Ustr does not solve this problem as it does not provide really some kind of   adapter<generic encoding> {     string content   } This is some kind of thing that may be useful, but not in this case. Basically your library provides wrapper around string and outputs Unicode code points but it does it for UTF encodings only! It does not benefit too much. You provide encoding traits but it is basically meaningless for the propose you had given as: It does not provide traits for non-Unicode encodings like lets say Shift-JIS or ISO-8859-8 BTW you can't create traits for many encodings, for example you can't implement traits requirements: http://crf.scriptmatrix.net/ustr/ustr/advanced.html#ustr.advanced.custom_encoding_traits For popular encodings like Shift-JIS or GBK... Homework: tell me why ;-) Also it is likely that encoding is something that can be changed in the runtime not compile time and it seems that this adapter does not support such option. > The problem mainly arise because there are a small minority of > developers who use different encoding for the same string type. If someone uses strings with different encodings he usually knows their encoding... The problem is that API inconsistent as on Windows narrow string is some ANSI code page and anywhere else it is UTF-8. This is entirely different problem and such adapters don't really solve them but actually make it worse... Other problem is ================ I don't believe that string adapter would solve any real problems because:    a) If you iterate over code points you are very likely do something       wrong. As code point != character and this is very common mistake.    b) If you want to iterate over code points it is better to have some       kind of utf_iterator that receives a range and iterate over it,       it would be more generic and do not require to have an additional       class.       For example Boost.Locale has utf_traits that allow to implement       iteration over code points quite easily.       See:        http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/namespaceboost_1_1locale_1_1utf.html        http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/structboost_1_1locale_1_1utf_1_1utf__traits.html       And you don't need any kind of specific adapters.    c) The problem in Boost is not missing Unicode String and it is not       even required to have yet-another-unicode-string that we have       good Unicode support.       The problem is policy the problem is Boost just can't decide once       and forever that std::string is UTF-8... But don't get me wrong. This is My Opinion, many would disagree with me.   ================================= Bottom line, Unicode strings, cool string adapters, UTF-iterators and even Boost.Unicode and Boost.Locale would not solve the problems that Boost libraries use inconsistent encodings on different platforms. IMHO: the only way to solve it is POLICY. Artyom Beilis -------------- CppCMS - C++ Web Framework:   http://cppcms.sf.net/ CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk