Boost logo

Boost :

Subject: Re: [boost] GSoC Unicode library: second preview
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2009-06-21 20:58:07

Artyom wrote:

> Where is the source code?

On the sandbox svn, as I said. I can provide a tarball or zip if it is
really needed, but it is easier to keep the svn up to date.

>> UTF-16 ... This is the recommended encoding for dealing with
>> Unicode internally for general purposes
> To be honest, it is most error prone encoding to work with Unicode:

You're not supposed to deal with it for text management, it is nothing
more than the encoding of your raw data.

UTF-16 is recommended because it allows algorithms to operate
efficiently while minimizing memory waste, and thus is believed to be a
better compromise than UTF-8 and UTF-32.

Of course, the library works with UTF-8 and UTF-32 just as well, it
makes no difference to the generic algorithms (which don't exist yet,
but expect substring searching and the like), it's up to you to choose
what makes the most sense to use for your situation (for example, you
may choose to use UTF-8 because you need to interact a lot with
programming interfaces expecting that format).

> 1. It is variable length encoding
> 2. There surragate charrecters are quite rare and thus it is very
> hard to find bugs related with it.

All facilities of the library take that into account, and even more if
you ask them to (they will be able to work at the grapheme cluster level
also, it's just part of the generic interface when it is relevant).

> It was mostly born as a "mistake" at the beggining of the unicode
> when it was beleved that 16bit is enough for signle code point.
> So many software platforms adopted 16 bit encoding that supported
> only BMP, As a result you can **easily** find **huge** amount of
> bugs in the code that uses utf-16, In most of cases such bugs
> are hard to track because these code points are rare.

They should be fairly easy to find.
Either you're using the algorithm that does the task correctly, or
you're fiddling with the encoding by hand which is likely to be wrong.

>> UTF-32 ... This encoding isn't really recommended
> As I mentioned above, it is not quite true, it is much safer encoding
> to work with,

In my personal opinion, it only exists in order to be "politically
correct", so that broken code that relies on the illusion that you can
have fixed-size characters keeps working.

> - For boundary checks I'd suggest to use ICU or Qt4 like API: iterate
> over string and return each time next bound.

That's what consumer_iterator, and the _bounded functions that invoke
it, do.

With UTF-8 as the source code character encoding, and assuming C++0x
features for readabilty:

char foo[] = "eoaéôn";
for(auto subrange : u8_bounded(foo))
    for(unsigned char c : subrange)
        cout << c;

    cout << ' ';
cout << endl;

e o a é ô n
(i.e. spaces are only put between code points, not code units)

> Not check if there is
> a bound on specific character.

Checking if a given position constitutes a boundary is a fairly useful
primitive to have for certain algorithms (since you can delay the
boundary check until the moment it is really needed instead of doing it
everywhere), and can be useful for applications that need to have some
kind of pseudo random-access.
It's also a primitive that the Unicode standard provides optimized
implementations of (part of the unicode character database contains
information to speed up that primitive on grapheme clusters for example).

consumer_iterator can either be implemented in terms of such a primitive
of in terms of the Consumer primitive.

It can therefore be used to iterate sequences of code units, grapheme
clusters, words, sentences, lines, etc. (any pattern modeled by the
Consumer concept)

> - Examples and More description is required

There is some pretty simplistic example in the source,
libs/unicode/example/test.cpp, which I mostly use to check things work
between refactorings without setting up unit tests.

I'll try to work on a tutorial.

Boost list run by bdawes at, gregod at, cpdaniel at, john at