Boost logo

Boost :

From: Erik Wien (wien_at_[hidden])
Date: 2005-07-24 13:14:29


Rogier van Dalen wrote:
> There was a student project aiming to produce a Unicode library, but I
> didn't hear anything of it after the thread in
> http://lists.boost.org/boost/2005/03/22580.php

Aw shucks... Time flies. I was hoping to get back to you sooner on this
(You've all been very helpful here, so you deserve more feedback than
you have been getting.), but after completing the project I got a job
that takes up most of my time, and thus the unicode library has had to
play second fiddle for a while. Anyway, this is just as good an occation
as any I guess, so I might just as well just fill you in on the latest
developments.

We finished our bachelor's project a couple of months ago, and we ended
up with a fairly useable (although highly unpolished) implementation.
The intention was to release that to you guys, but I wasn't completely
happy with some things in that version, so I decided to rewrite large
portions of it to make it "worthy" of your scrutiny. (I mean, how long
could that take? Sigh..) This obviously took much longer than I expected
(In fact, I'm still working on it), and that is more or less why you
haven't heard from me until now.

The version I have now, provides mutable code point strings
(boost::ReversibleContainers), with both dynamic and locked encoding
forms (More or less the same as the ones described in earlier threads
here). It also supports normalization (all forms) of code point
sequences through STL compatible algorithms, with a complete set of
tests to test the validity of these algorithms. Finally the library
provides a mutable "text-element" string class, that can represent
strings of grapheme clusters, words, (sentences,) or anything else you
want it to, normalized to a specified normalization form, and in any
encoding you want. Tests for checking that grapheme clusters and words
are broken up correctly are also provided. (The
text_element/text_element_string should be pretty close to what you
wanted as a "unicode string class", it certainly has the monster of a
value_type thing covered. ;)) I can give you some more details on the
design and implementation of this a little later, I don't have enough
time to do that right now. It does need to be revised though, as it's
rather clumsy to use as it is. (This is what I was hoping to do during
the summer, but didn't have the time to.)

As for creating the "Boost unicode library", which really is the
ultimate goal of all this mocking about, I am beginning to feel like we
should make more of a community effort out of it. There are clearly a
lot of people experienced in Unicode here (with Graham Barnett now
joining the club), but as you said, noone seems to have much time to
spend on it. Therefore a more organized collaboration between all of us,
on both design and code, could be a good idea to get this moving along a
little faster. I'd be more than happy to donate the code I have
developed so far to serve as a starting-point/reference/example of
failure for something like that. Any thoughts?

- Erik


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk