Boost logo

Boost :

Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
From: Soares Chen Ruo Fei (crf_at_[hidden])
Date: 2011-08-13 15:42:28


On Sat, Aug 13, Artyom Beilis wrote:
> I understand and I do not expect from you implementing
> every encoding, but some encodings lets say Latin1 or ASCII
> should be given to at least provide an example.

That's an easier task to do. :) I'll try to find time to implement
probably the ASCII and URL-encoded encoding traits but I'm not sure if
I can make it before the GSoC deadline next week.

> The problem, that unlike variable width UTF encodings
> that have clear separation between the lead and
> the trail code units that multi-byte encodings
> like Shift-JIS or GBK have not such separation.
>
> UTF-8 and UTF-16 encodings are so called
> self synchronizing, you can go forward,
> you can go backward without any problem
> and even if you lost the position you
> can find the next valid position in either
> direction and continue.
>
> However with non-Unicode CJK encodings
> like Shift-JIS or GBK there is no
> way to go backward because it is ambiguous,
> and in order to decode text you always
> should go forward.
>
> That is why the traits model you provided
> has conceptual flaw as it is impossible
> to implement bidirection iterator
> over most non UTF CJK multibyte encodings.

Ahh I see so that's quite nasty, but actually it still can be done
with the sacrifice on efficiency. Basically since the iterator already
has the begin and end boundary iterators it can simply reiterate all
over from the beginning of the string. Although doing so is roughly
O(N^2) it shouldn't make significant impact as developers rarely use
this multi-byte encoding and even seldom use the reverse decoding
function. We can also find way to see if there are any cases where we
can determine the previous character position without having to go
back to the beginning, and if that is actually the norm then the
penalty will become less significant as well.

>> It is always possible to add a dynamic layer on top of the
>> static
>> encoding layer but not the other way round. It shouldn't be
>> too hard
>
> Is encoding traits object is a member of unicode_string_adapter?
> >From what I had seen in the code it does not - correct me if I wrong.
>
> So in current situation you can't make dynamic encoding traits.

No, the encoding traits is static according to my design. But what I
mean about dynamic encoding is something different than you thought.
See https://github.com/crf00/boost.ustr/blob/master/boost/ustr/dynamic_unicode_string.hpp
which is the dynamic encoding string class that I have just wrote. I
haven't include full functionality but you can get the basic idea of
my dynamic string from there.

>> to write a class with virtual interfaces to call the proper
>> template
>> instance of unicode_string_adapter. But that is currently
>> outside of
>> the scope and the fundamental design will still be static
>> regardless.
>>
>
> This is a problem as for example typical use case where ANSI
> code page is used as default encodings is something that
> is defined by the OS the program runs on.

Basically Boost.Ustr is designed to be completely locale-agnostic so
it does not try to play well with locale rule. As I said above the
dynamic encoding string is probably the feature you want, but actually
I think this problem you mention can still be solved using purely
static encoding. It can be something like:

unicode_string_adapter<std::string> get_string_from_locale_sensitive_system() {
    const char* raw_string = get_locale_dependent_system_string();

    CodePage codepage = get_system_codepage();
    if(codepage == CodePage::UTF8_CodePage) {
        return unicode_string_adapter<std::string>(raw_string);
    } else if(codepage == CodePage::932_CodePage) {
        return unicode_string_adapter<std::string, ...,
ShiftJisEncoder, ...>(raw_string);
    } else if(codepage == CodePage::950_CodePage) {
        return unicode_string_adapter<std::string, ..., Big5Encoder,
...>(raw_string);
    }
}

It is probably not a good idea to pass a string encoded in uncommon
encoding and let it slip through the entire system even with the
proper encoding tag. Such a design would eventually still lead to bugs
even with the best possible help of Unicode utilities. The better idea
is to make use the automatic conversion and convert the string back to
UTF-8 string as soon as the string of the uncommon encoding is no
longer needed.

> Yes, but wide characters useless for cross platform development
> as they UTF-16 only on Windows on other OSes they are UTF-32

Since unicode_string_adapter makes use of template-metaprogramming
technique to choose either UTF-16 or UTF-32 encoding for wchar_t
strings, using wchar_t together with Boost.Ustr should be much less
painful IMHO, and it can always be converted to UTF-8 string with an
extra line of code.

>> So while it's hard to construct UTF-8 string literals on
>> Windows, it
>> should still be possible by writing code that manually
>> insert UTF-8
>> code units into std::string. After all std::string does not
>> have any
>> restriction on which bytes we can manually insert into it.
>>
>
> Sorry but IMHO this is quite ugly solution...

The implementation may be ugly but it is encapsulated already. The
main issue is whether it makes the *user* code uglier or more elegant.
Most users would not care about the extra overhead of converting the
short string literals during runtime, and IMHO it's much more worth it
to save developers' time than avoiding run-time overhead.

>> Perhaps you are talking about printing the string out to
>> std::out, but
>> that's another hard problem that we have to tackle
>> separately.
>
>
> Actually this is quite simple even on Windows.
>
> In console write chcp 65001 and then UTF-8 output
> will be shown correctly.
>
>> The unicode_string_adapter class has a well defined
>> role which
>> is to offer *code point* level access.
>
> No problem with that but shouldn't it be maybe
> code_point iterator and not "String-Adaoptor"?

You can suggest a better name, the current name
"unicode_string_adapter" is a bit too long anyway. But code point
iterator is just a small part of the whole design. There are much more
issues on designing to make it work closely with the original string
than just producing a code point iterator.

>> I did plan to write
>> another
>> class that does the character iteration, but I don't have
>> enough time
>> to do it yet.
>>
>
> If you implement character iteration it will not be
> compact library as character iteration
> requires access to Unicode database of
> character properties.
>
> And actually Boost.Locale already provides character
> iteration.
>
> http://svn.boost.org/svn/boost/trunk/libs/locale/doc/html/namespaceboost_1_1locale_1_1boundary.html
>
> And Boost.Unicode should provide grapheme segmentation
> with different interface.

Yup.. That's why this is not a high priority task for me. I try to
implement just the string-related Unicode functions and leave the rest
to Boost.Unicode and Boost.Locale. :) And even if I write an abstract
character class it will probably be separated with the core class and
rely on these other Unicode libraries to perform the actual
functionality.

>> I added a static method
>> unicode_string_adapter::make_codepoint_iterator to do what
>> you have
>> requested. It accepts three code unit iterator parameters,
>> current,
>> begin, and end, so that it can iterate in both directions
>> without
>> going out of bound. Hope that is what you looking for.
>>
>
> Yes, this is better, but it should not be a static method
> of string_adaptor but a free algorithm.

Actually the code point iterator class is now independent and can be
used for generic purpose. However it requires a few more template
parameters so it may be even less convenient to use it directly.
Currently it's signature is as follow:

template <typename CodeunitIterator, typename Encoder, typename Policy>
class codepoint_iterator
{
  public:
    codepoint_iterator(
            codeunit_iterator_type codeunit_it,
            codeunit_iterator_type begin,
            codeunit_iterator_type end);
};

If you like this class I'll try to implement some convenient functions
to construct UTF-8/16/32 iterators.

Thanks.

cheers,

Soares


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk