Boost logo

Boost :

Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
From: Soares Chen Ruo Fei (crf_at_[hidden])
Date: 2011-08-14 08:37:44


On Sun, Aug 14, 2011, Artyom Beilis wrote:
> Except that these are default narrow string encodings on Windows
> (and sometimes even on Linux) at China, Japan and Korea...
>
> No they are not rare encodings.

Well we are talking about *cough* Windows *cough* here, it's not like
it is widely used by developers in preference of their *own* choice.
(except Shift-JIS)

But for me I'd think of these encodings as depreciated encodings that
developers should not use it in newer programs. Of course we got old
generations programmers who don't care about portability and insist on
using their old time favorite encodings, and we know how long it takes
to fully depreciate something. But Boost.Ustr's intended audience is
for those who do *want* everything Unicode badly but is forced to
somehow deal with small portion of legacy code that still uses the old
time MBCS encodings.

On the other hand, I don't really consider any use case to make
Boost.Ustr easy for hard core developers who *insist* on continue
using the MBCS encodings while expecting Boost.Ustr to let them use
new Unicode libraries on their MBCS strings. Sure you can do that, but
it is currently out of my scope and intention to support it.

> The better solution would be to create an index of
> Shift-JIS -> code-point and use it, you can
> probably do it in lazy way on first attempt
> of backward iteration.
>
> This is what I do for Boost.Locale.

Sorry I thought index is the same as translation table, and that's how
the decoding is supposed to work?

> Ok I see this code:
>
> class dynamic_codepoint_iterator_object
>     : public std::iterator<std::bidirectional_iterator_tag, codepoint_type>
> [...]
>
> THIS IS VERY BAD DESIGN.
>
> ------------------
>
> I've been there.
>
> Think of
>
> while(pos!=end) {
>    code_point = *pos;
>    ++pos;
> }
>
> How many virtual calls for one code point required?
>
> 1. equals
> 2. deference
> 3. increment
>
> This is horrible way to do things.

It depends on how you look at it actually, but I'm not surprised that
most C++ programmers would complain on such design and anything that
involves virtual function. (I remember someone also complained that
your Boost.Locale library contains even minimal number of virtual
functions. :)

People can should at it the same way they should how many machine
instructions are spent on a for-loop in Python or Javascript. It is
actually a matter of preference on whether you prefer minimal coding
with slower performance, or more coding with better performance. For
me I'd just choose the right design for the right situation.

I cleanly separated the static and dynamic part into two separate
classes, so that you have full freedom to choose whichever class that
you see fit. The reason I design dynamic_unicode_string in such way is
so that you can work transparently with any types of string, be it
std::string, std::u16string, std::wstring, std::vector, or anything
else. And surely with such a flexible design the only way to achieve
it is by using virtual functions.

If you are performance critical, you can just use
unicode_string_adapter and ignore this dynamic_unicode_string class.

> I've started to work on generic "abstract-iterator"
> for Boost.Locale however hadn't completed the work
> yet. It allows to reduce virtual call per-character
> below 1.

Can you show me the code of your abstract-iterator? Perhaps your
objective is different from mine so our designs are different, or
perhaps you do have a better design that I can learn from.

> This is not a way to go.
>
> (BTW you had forgot clone() member function)

Oh yah, ok I'll add that when I have the time. Thanks!

> The entire motivation behind this library was to provide
> some "Unicode Warrping" over different encodings.
>
> And if you tell me that the library is "locale" agnostic makes
> it unsuitable for the proposed motivation because
> non-Unicode encodings do change in run-time.
>
> But forget locale - according to the motivation requires
> to support at least runtime OS ANSI codepage to Unicode
> which it does not support.
>
> This is the biggest flaw of the current library.

I am sorry but can you give me an example of how that actually
happens? I'm not familiar with Windows development so I don't know
most of the quirks on it.

By changing locale on run-time, do you mean that an existing char*
string stored on the stack/heap can suddenly change it's byte content
to fit certain encoding that is changed on run time? Or is the
run-time change only affects new char* strings obtained from the
Windows API while old char* strings still retain their old byte
content and encoding? If it is the latter I don't see why the proposed
snippet in my previous message doesn't solve your problem?

By locale-agnostic what I actually mean is to leave the locale related
problems to a higher layer. I think locale and encodings are two
separate issues and they should really be implemented in different
classes and libraries. And actually I don't like the design of
locale-aware function without delegating the actual functionality to a
locale-agnostic function. For me, I think all locale-aware functions
should be implemented in two steps: one that detects the locale and
then delegates to another specific function that does the work and
ignores the current locale.

> Or maybe just convert "uncommon-encoding" to UTF-8/UTF-16/UTF-32
> and forget all the wrapper?

Yes if you can do it and assume all std::string is UTF-8 then of
course you don't need Boost.Ustr anymore. But until that happens,
Boost.Ustr is what I think will be at the moment. :)

> Dear Soares Chen Ruo Fei,
>
> (Sorry I have no idea what is the first name :-) )

You can call me Soares, which is my unofficial English name that I
give to myself because my given name is hard to pronounce by English
people. Chen is my family name. I have to keep my given name in the
email because that is my official name and I thought it might be
easier for GSoC to track on my posts.

> Don't get me wrong. I see what you are trying to do and it is
> good thing.
>
> There is a one big problem:
>
>     You entered very-very-very dangerous swamp.
>
> It is very easy to do wrong assumptions not because you don't try
> to do the best but because even such a small problem is very
> diverse. And the "best" you can do is to assume that if there
> can be something wrong it will.

I do realize that I am *always* trying to do something very dangerous
for almost all my previous and current projects. I don't like the
harsh criticisms I get for my radical ideas and ambitions, and I don't
like myself choosing on projects that are "dangerous". But since
that's who I really am, I just have no choice but to live with it and
hope to strive for an eventual success.

> It is not an accident that there is no "unicode string"
> in Boost and that there are too many "fancy" Unicode
> strings around like QString, icu::UnicodeString, gtk::ustring
> and many others that somehow fail to provide the goodies
> you need.

That doesn't stop me from trying and learn something valuable in case
it still fail. After all, this is what GSoC really should be about
right, to try to solve something challenging and learn by making
mistakes. ;)

cheers,

Soares


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk