Boost logo

Boost :

Subject: Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2011-01-20 07:49:08


On 20/01/2011 05:38, Patrick Horgan wrote:
> On 01/19/2011 08:51 AM, Mathias Gaunard wrote:
>> My Unicode library works with arbitrary ranges, and you can adapt a
>> range in an encoding into a range in another encoding.
>> This can be used to lazily perform encoding conversion as the range is
>> iterated; such conversions may even be pipelined.
> Sounds interesting. Of courses ranges could be used with strings of
> whatever sort. Is the intelligence about the encoding in the ranges?

I've chosen not to attach encoding information to ranges, as this could
make my Unicode library quite intrusive.

It's a design by contract; your input ranges must satisfy certain
criteria, such as encoding, depending on the function you choose to
call. If the criteria are not satisfied, you either get undefined
behaviour or an exception, depending on the version of the function you
choose to call.

> As
> you iterate a range does it move byte by byte character by character,

You can adapt a range of code units into a range of code points or into
a range of ranges of code points (combining character sequences,
graphemes, words, sentences, etc.)

> does it deal with compositions?

It can.

My library doesn't really have string algorithms, it's up to you to make
sure you call those algorithms using the correct adapters.

For example, to search for a substring in a string, both of which being
in UTF-8, and taking into account combining characters, there are
different strategies:
- Decode both to UTF-32, normalize them, segment them as combining
character sequences, and perform a substring search on that.
- Decode both to UTF-32, normalize them, re-encode them both in UTF-8,
perform a substring search at the byte level, and ignore matches that do
not lie on the utf8_combining_boundary (checks whether we're at a UTF-8
code point boundary, decodes to UTF-32, checks whether we're at a
combining character boundary).

You could want to avoid the normalization step because you know your
data is already normalized.
The second one has chances to be quite faster than the former, because
you spend most of the time working on chars in actual memory, which can
be optimized quite aggressively.

Both approaches are doable directly in a couple of lines by combining
Boost.StringAlgo and my Unicode library in various ways ; and all
conversions can happen lazily or not as one wishes.
Boost.StringAlgo isn't however that good (it only provides naive O(n*m)
algorithms, doesn't support right-to-left search well, and certainly is
unable to vectorize the cases where the range is made of built-in types
contiguous in memory), so eventually it might have to be replaced.

> Is it available to read?

Somewhat dated docs are at <http://mathias.gaunard.com/unicode/doc/html/>
A presentation is planned for Boostcon 2011, and a submission for review
before that.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk