Boost logo

Boost :

Subject: Re: [boost] GSoC Proposal Preparation For Encoding Awared String
From: Soares Chen (crf_at_[hidden])
Date: 2011-03-24 00:36:13


Hi Artyom,

> I think boost::filesystem v3 is a big step forward, it allows you
> to use UTF-8 strings on Windows which I think is a really good
> beginning.
>
> Bottom line, if you want to improve Unicode awareness of Boost
> I think you need to adopt Boost.Filesystem v3 like policy
> all over the code base of Boost.
>
>
> 1. Use Wide API as native one in Boost everywhere under Windows
> 2. Use char * API as native one in Boost everywhere under non-Windows platforms
> 3. Use std::codecvt to handle this (after many tricks... )

Thanks for the suggestion to look at Boost.Filesystem. I have observed
how basic_path<> is constructed in Boost.Filesystem and use a similar
design in my proposal. Feel free to take a look at it.

> Even if you create a perfect UTF-8 string and then call
>
> fopen(your_perfect_string.c_str(),"r")
>
> Under windows... And it would not work <sigh... damn Windows>

I'd take a shortcut in explanation and ask you to look at the proposal
I just posted. Using my proposed class the function can be written
with something like this:

typedef unicode_string_adapter<std::string, ....> utf8_string;
typedef unicode_string_adapter<std::basic_string<wchar_t>, ....> utf16_string;

utf8_string my_path("/path/to/file");

#ifdef WINDOWS
// transparent conversion through constructor
fopen(utf16_string(my_path).str().c_str(), "r");
#else
fopen(mypath.str().c_str(), "r");
#endif

> Boost.Locale and several other my projects (CppCMS, CppDB) live happily
> with std::string.

I think that it is not hard to maintain encoding consistency if the
developer only use libraries maintained by the same group of people.
But as more libraries are used and mixed together, for example if
CppCMS had a module system, and when different libraries use different
Unicode processing backends from Boost.Locale, then only weird bugs
will appear randomly. Most of the Unicode bugs are at least not fatal,
and usually only consist of annoyance when end users see weird text
appear on the screen.

> The problem is that in vast majority of cases you don't need encoding aware
> string, as so many operations you usually do on strings are encoding
> agnostic. But this is other story.

Yes, in many cases there is no need to look at the content of the
string, and the main thing the code does is to pass on everything that
is inside a string. But since it doesn't matter on what's get passed
along, it would hurt to pass along content+encoding information, as
the external operation is essentially the same. The only real
difference is the type of the string. But functions that accept
certain type of objects but do not really care about it's type and
content probably seems like a good candidate to become a generalized
templates anyway.

Anyway back to the topic, for legacy code that accepts std::string
instead of my proposed unicode_string_adapter, one can easily convert
to it's internal string type by simply calling operator *() or similar
named function. The template class may even have an operator StringT()
to implicitly convert itself to it's underlying string type when
needed.

> Why?
>
> 1. Because you will never get the consensus about what is the "right-thing"
> to do (wide, narrow, utf-8, utf-16) etc.
>
> Project that are handled and directed by a single source or management
> like Qt, GTK(mm), Java, C#, Python or others may decide what is the
> right thing.
>
> This will never happen in Boost as it is too pluralistic even in cases
> where it does not always make sense, just because the way libraries
> are developed, reviewed and got in - based on public reviews
> that eventually encourages diversity.
>
> 2. Because you would not likely to be able to enforce users to actually
> use your string. As boost is more about collaboration then enforcement
> of specific style.

You are right. Because there is no one single "right" way to use
strings, my new proposal actually provides a generic solution on top
of existing strings. Hopefully the ability to choose any string type
as underlying container will give the best of both world.

> 3. Even heavy discussions there hadn't got to any conclusion. So what would
> happen and final review of your library?

Heavy discussions indicate that the problem do exist and there is a
need for solution. However I think the reason that the previous
discussion was never ending was because nobody was willing to take
action to write a library other than Chad and now Dean. Talk is cheap
so it is easy to criticize a solution without solid evidence. I think
that the best way for the discussion to move forward is to get more
people to take action and implement their own code, if they are really
oppose to existing solution so badly. Let the code fork, merge, and
speak for themselves, and let the code with the best solution be the
winner. After all, this is how open source works, right? ;)

Thanks for your feedback!


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk