Boost logo

Boost :

Subject: [boost] [RFC] Unicode and Converters/Segmenters
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2010-08-01 18:17:55


It's been a while (about a year actually) since I had some feedback
about my Unicode library, so here I am requesting for comments.

The Unicode library provides facilities to convert between UTF and
locale encodings in a way as nice and generic to use as possible, as
well as a few Unicode character properties that can be used for
normalization or segmentation into graphemes.

The largest part of the library is actually a fairly intricate generic
Converter and Segmenter system, that, among others, allows to define, in
an easy and stateless way, a variable-width N to M conversion step.

The conversion can then be applied normally on the input, or step by
step by an iterator or range adaptor, essentially performing a lazy
conversion.
Converters can be combined, and can be used to make codecvt facets,
which allows them to be transparently applied by standard file streams.
Converters can also be built from codecvt facets, which is how the
Unicode library provides conversion between locale encodings.

I think the whole system really deserves to be a library of itself and
not just part of Unicode, but I'm unsure of how to deal with this in Boost.
I think it's quite cool, but I haven't really seen much interest into
it. I may write a short tutorial of how to write base64 codecs with it
and how to use that with iostreams just to show it off a bit more
outside of a Unicode context.

Anyway, the docs are here:
<http://mathias.gaunard.com/unicode/doc/html/>

And the code is on the sandbox:
<https://svn.boost.org/svn/boost/sandbox/SOC/2009/unicode/>

As I have said before, I will be submitting the full thing for formal
review *soon*, i.e mid-september.

The changes that will go in are mostly performance-related: I'm
experiencing with things right now and doing benchmarks, considering
unsafe codecs and SIMD ones (SIMD is not just an implementation detail,
due to the step-by-step evaluation; using SIMD means having a much
larger step -- and of course, it cannot be safe).
I also need to tackle the issue of compile-time, which is quite large: I
need better header separation.

I also need to find a better solution, from a binary point of view, to
expose composition from the shared library, as the current one doesn't
give much flexibility in implementation.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk