Boost logo

Boost :

Subject: Re: [boost] RFC: interest in Unicode codecs?
From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2009-07-20 09:18:16


On Sat, Jul 18, 2009 at 07:14, Mathias
Gaunard<mathias.gaunard_at_[hidden]> wrote:
> Rogier van Dalen wrote:
>
>> Freestanding transcoding functions and codecvt facets are not the only
>> thing I believe a UTF library would need, though.
>
> I've personally purposely chose not to use codecvt facets in my unicode
> library at all, but maybe I should provide them anyway for compatibility
> with the iostreams subsystem.
> I don't really find those practical to use.

I don't necessarily disagree, but I'm curious what alternative you have in mind.

>> Iterator
>> adaptors, I found, are a pain to attach error policies to and write
>> them correctly. For example, with a policy equivalent to your
>> "ReplaceCheckFailures", you need to produce the same code point
>> sequence whether you traverse an invalid encoded string forward or
>> backward. I've got code for UTF-8 that passes my unit tests, but the
>> error checking and the one-by-one decoding makes it much harder to
>> optimise.
>
> For now my iterator adaptors (and the codecs they're based on for that
> matter) perform full checks, including checking that we don't go past the
> end of the input range (one way or the other).
> While I wanted both versions with checks and without initially, only having
> one does make it easier to use.

Non-checking iterator adaptors can be faster. That would be useful
when you know that a string is safe, for example, in a UTF string type
that has a validity invariant.

> An error policy isn't really enough though, because to do full checks you
> need each iterator to know about the begin and the end of the range it's
> working on which could be avoided altogether when trusting the input.

I think this means that all iterator adaptors can be constructed from
3 iterators (begin, position, end) and the ones that don't check the
input can also be constructed from 1 iterator. For a checking forward
iterator, only two iterators are necessary (position, end). This is
how I implemented this, at any rate.

> They're fairly simple implementations and were never benchmarked
> (benchmarking my library isn't even scheduled at the moment), but they're
> quite correct (proper unit tests are in the works).

It makes sense to design for correctness. It's probably worth keeping
in minds, though, whether conceivable extensions and optimisations are
possible in your design.

I like the idea of the Pipe and related concepts. I am wondering,
however, whether the UTF-8 decoding iterator can be fast enough given
the current specification. I think Pipe (or another concept) might
have to support decoding of exactly one output element. Correct me if
I'm wrong.

The actual implementation of extensions and optimisations can be
delayed until the need appears. I'd be happy to contribute checking
policies.

Cheers,
Rogier


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk