|
Boost : |
Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
From: Artyom Beilis (artyomtnk_at_[hidden])
Date: 2011-08-14 06:37:02
----- Original Message -----
> From: Soares Chen Ruo Fei <crf_at_[hidden]>
>> The problem, that unlike variable width UTF encodings
>> that have clear separation between the lead and
>> the trail code units that multi-byte encodings
>> like Shift-JIS or GBK have not such separation.
>>
>> UTF-8 and UTF-16 encodings are so called
>> self synchronizing, you can go forward,
>> you can go backward without any problem
>> and even if you lost the position you
>> can find the next valid position in either
>> direction and continue.
>>
>> However with non-Unicode CJK encodings
>> like Shift-JIS or GBK there is no
>> way to go backward because it is ambiguous,
>> and in order to decode text you always
>> should go forward.
>>
>> That is why the traits model you provided
>> has conceptual flaw as it is impossible
>> to implement bidirection iterator
>> over most non UTF CJK multibyte encodings.
>
> Ahh I see so that's quite nasty, but actually it still can be done
> with the sacrifice on efficiency. Basically since the iterator already
> has the begin and end boundary iterators it can simply reiterate all
> over from the beginning of the string. Although doing so is roughly
> O(N^2) it shouldn't make significant impact as developers rarely use
> this multi-byte encoding and even seldom use the reverse decoding
Except that these are default narrow string encodings on Windows
(and sometimes even on Linux) at China, Japan and Korea...
No they are not rare encodings.
The better solution would be to create an index of
Shift-JIS -> code-point and use it, you can
probably do it in lazy way on first attempt
of backward iteration.
This is what I do for Boost.Locale.
>>> It is always possible to add a dynamic layer on top of the
>>> static
>>> encoding layer but not the other way round. It shouldn't be
>>> too hard
>>
>> Is encoding traits object is a member of unicode_string_adapter?
>> >From what I had seen in the code it does not - correct me if I wrong.
>>
>> So in current situation you can't make dynamic encoding traits.
>
> No, the encoding traits is static according to my design. But what I
> mean about dynamic encoding is something different than you thought.
> See
> https://github.com/crf00/boost.ustr/blob/master/boost/ustr/dynamic_unicode_string.hpp
> which is the dynamic encoding string class that I have just wrote. I
> haven't include full functionality but you can get the basic idea of
> my dynamic string from there.
>
Ok I see this code:
class dynamic_codepoint_iterator_object
: public std::iterator<std::bidirectional_iterator_tag, codepoint_type>
{
public:
virtual const codepoint_type dereference() const = 0;
virtual void increment() const = 0;
virtual void decrement() const = 0;
virtual bool equals(const dynamic_codepoint_iterator_object* other) const = 0;
virtual const unicode_string_type& get_type() const = 0;
virtual void* get_raw_iterator() const = 0;
virtual ~dynamic_codepoint_iterator_object() { }
};
THIS IS VERY BAD DESIGN.
------------------
I've been there.
Think of
while(pos!=end) {
code_point = *pos;
++pos;
}
How many virtual calls for one code point required?
1. equals
2. deference
3. increment
This is horrible way to do things.
I've started to work on generic "abstract-iterator"
for Boost.Locale however hadn't completed the work
yet. It allows to reduce virtual call per-character
below 1.
This is not a way to go.
(BTW you had forgot clone() member function)
>>> to write a class with virtual interfaces to call the proper
>>> template
>>> instance of unicode_string_adapter. But that is currently
>>> outside of
>>> the scope and the fundamental design will still be static
>>> regardless.
>>>
>>
>> This is a problem as for example typical use case where ANSI
>> code page is used as default encodings is something that
>> is defined by the OS the program runs on.
>
> Basically Boost.Ustr is designed to be completely locale-agnostic so
> it does not try to play well with locale rule. As I said above the
> dynamic encoding string is probably the feature you want, but actually
> I think this problem you mention can still be solved using purely
> static encoding. It can be something like:
>
> unicode_string_adapter<std::string>
> get_string_from_locale_sensitive_system() {
> const char* raw_string = get_locale_dependent_system_string();
>
> CodePage codepage = get_system_codepage();
> if(codepage == CodePage::UTF8_CodePage) {
> return unicode_string_adapter<std::string>(raw_string);
> } else if(codepage == CodePage::932_CodePage) {
> return unicode_string_adapter<std::string, ...,
> ShiftJisEncoder, ...>(raw_string);
> } else if(codepage == CodePage::950_CodePage) {
> return unicode_string_adapter<std::string, ..., Big5Encoder,
> ...>(raw_string);
> }
> }
>
The entire motivation behind this library was to provide
some "Unicode Warrping" over different encodings.
And if you tell me that the library is "locale" agnostic makes
it unsuitable for the proposed motivation because
non-Unicode encodings do change in run-time.
But forget locale - according to the motivation requires
to support at least runtime OS ANSI codepage to Unicode
which it does not support.
This is the biggest flaw of the current library.
> It is probably not a good idea to pass a string encoded in uncommon
> encoding and let it slip through the entire system even with the
> proper encoding tag. Such a design would eventually still lead to bugs
> even with the best possible help of Unicode utilities. The better idea
> is to make use the automatic conversion and convert the string back to
> UTF-8 string as soon as the string of the uncommon encoding is no
> longer needed.
>
Or maybe just convert "uncommon-encoding" to UTF-8/UTF-16/UTF-32
and forget all the wrapper?
=====================================================================
=====================================================================
Dear Soares Chen Ruo Fei,
(Sorry I have no idea what is the first name :-) )
Don't get me wrong. I see what you are trying to do and it is
good thing.
There is a one big problem:
You entered very-very-very dangerous swamp.
It is very easy to do wrong assumptions not because you don't try
to do the best but because even such a small problem is very
diverse. And the "best" you can do is to assume that if there
can be something wrong it will.
It is not an accident that there is no "unicode string"
in Boost and that there are too many "fancy" Unicode
strings around like QString, icu::UnicodeString, gtk::ustring
and many others that somehow fail to provide the goodies
you need.
====================================================================
Best Regards,
Artyom Beilis
--------------
CppCMS - C++ Web Framework: http://cppcms.sf.net/
CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk