Boost logo

Boost :

Subject: Re: [boost] [gsoc]Boost.Ustr Unicode String Adapter First Preview
From: Soares Chen Ruo Fei (crf_at_[hidden])
Date: 2011-06-20 15:46:18


On Sat, Jun 18, 2011 at 7:00 PM, Anders Dalvander <boost_at_[hidden]> wrote:
> The library looks very interesting. I like your approach that the default
> iterator iterates over codepoints and not codeunits. I also like the way you
> handle both mutable and immutable strings.
>
> It would be very helpful with some kind of documentation or tutorial. Just a
> few lines of sample code would help understand it better.

Thank you for liking my design approach. Currently I am working on the
documentation and figuring out how to use Doxygen for Boost style
documentation. It is probably going to take a while before I can
publish the first draft of the documentation so I'm going to explain
it briefly here.

The unicode_string_adapter class is intended to replace traditional
string containers such as std::string and std::vector by wrapping
these containers and "downgrade" their usage to become raw code unit
containers. unicode_string_adapter then provides a uniform way to
decode/encode code points out of the code unit containers regardless
of the actual underlying encoding.

I have just written a simple hello world example and upload it to
Github. You can view it at
https://github.com/crf00/boost.ustr/blob/master/libs/ustr/example/hello/hello.cpp.
Hopefully this example can help the understanding of the library.

> How does this library work with Artyom Beilis Boost.Locale library, Mathias Gaunard Boost.Unicode library and ICU?

Since unicode_string_adapter produces code point iterators, it should
work work Mathias's Boost.Unicode library functions that accept code
point iterators such as grapheme_segment. I have not tested it though.

unicode_string_adapter is specially designed for libraries that need
to provide APIs that accept strings with different encodings such as
Artyom's Boost.Locale and also Boost.Filesystem. It works by replacing
the legacy APIs that accept char*, wchar_t*, and std::string, and
replace these parameter types with a single unicode_string_adapter
template. However although the solution sounds easy, the biggest
challenge for existing libraries is that it will break existing APIs
unless the library author is willing to support unicode_string_adapter
together with legacy strings at the same time.

I intend to make ICU's UnicodeString class as one of the code unit
containers used by unicode_string_adapter in future. An ICU Unicode
string can then be written as unicode_string_adapter<UnicodeString> to
make it easily convertible to other string types such as
unicode_string_adapter<std::string> when needed.

> If the interoperability is good, then you probably don't need to create an unicode_abstract_character class.

There are some features I have in mind that can be greatly simplified
by using a class like this. The good thing of a
unicode_abstract_character is that we can then construct independent
objects that represents a single abstract character, which can be used
for higher level purposes. Anyway since this is a future planned
class, I think we can leave this for future discussion.

>> At the moment my library only works under GCC with C++0x enabled, as I
>> was focusing on the design issues first. I also understand that I have
>> not adopted the Boost way of building the project. While I am now
>> going to spend more time on fixing these issues, I hope that this
>> discussion can have more focus on the design issues instead.
>
> It would be great if it would work under VC++2010 as well. Would be a lot
> easier (for me at least) to test and play with.

I apologize for only trying to fix portability issues this late.
Currently there are a few compilation errors that I am not familiar of
so it might take a while for me to fix it.

>> Feel free to let me know any potential issues on the class design so
>> that I can fix it before it is too late. Thank you!
>
> I see a potential issue with the `unicode_string_adapter(const
> raw_char_type* other)` constructor, as it won't know the encoding of the
> string literal. See discussion between Ryou Ezoe and Artyom Beilis during
> the Boost.Locale review.

I did read on the discussions in the Boost.Locale review, agreeably
this is a very challenging problem. I tentatively added
`unicode_string_adapter(const raw_char_type* other)` constructor in
later time trying to make construction from raw string work in the
unit tests, but I agree with you that it most probably should be
removed.

The main challenge I found is that there is actually no portable way
to create static Unicode strings embedded in any C++ source code. From
my understanding the encoding of static strings within source code is
dependent on the locale the compiler is using and the operating
system, so it is not possible to statically choose a single
encoder/decoder that is used for processing the source code strings.

I have a solution in mind that can allow developers to statically
construct Unicode strings in a portable way, which is by using a
USTR() macro before any source code strings is passed to a
unicode_string_adapter constructor. So the construction of source code
strings will look something like

unicode_string_adapter<std::string> my_string( USTR(L"¥@¬É§A¦n") );

The USTR() macro will expand into another unicode_string_adapter
constructor with template argument that matches the current encoding
the compiler is using. So it will for example be expanded as such in
different platforms:

// UTF-16 source encoding
unicode_string_adapter<std::string> my_string( unicode_string_adapter<
std::basic_string<char16_t> >(L"¥@¬É§A¦n") );

// UTF-32 source encoding
unicode_string_adapter<std::string> my_string( unicode_string_adapter<
std::basic_string<char32_t> >(L"¥@¬É§A¦n") );

// GB2312 Chinese source encoding
// Not sure if this could happen in Windows, but even if it does,
USTR() can still handle that
unicode_string_adapter<std::string> my_string( unicode_string_adapter<
    std::basic_string<char16_t>,
    string_traits< std::basic_string<char16_t> >,
    gb2312_encoding_traits< string_traits< std::basic_string<char16_t> >
>(L"¥@¬É§A¦n") );

Until the USTR() macro is built, I don't think it is possible to even
use unicode_string_adapter to handle source code Unicode strings.
Unfortunately it seems like there is no better solution exists.

Best Regards,

Chen Ruo Fei


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk