Boost logo

Boost :

Subject: Re: [boost] RFC: interest in Unicode codecs?
From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2009-07-20 17:33:54


On Mon, Jul 20, 2009 at 18:42, Eric Niebler<eric_at_[hidden]> wrote:
> Mathias Gaunard wrote:
>>
>> Rogier van Dalen wrote:
>>>
>>> Non-checking iterator adaptors can be faster. That would be useful
>>> when you know that a string is safe, for example, in a UTF string type
>>> that has a validity invariant.
>>
>> I suppose that type of string should probably use optimized iterators that
>> make use of the fact it is stored on contiguous and properly aligned memory
>> anyway, so it will need special code.
>
> There are 2 orthogonal issues here:
> 1) whether a sequence is stored in contiguous memory
> 2) whether it is already guaranteed to be well-formed UTF-XX

I think where confusion could arise is this: even thought these issues
are orthogonal, if it's just about optimising, it might be acceptable
to write code for a specific special case.

However, one policy that is sensible is of "repairing" an invalid
string: interpreting overlong sequences; and replacing uninterpretable
code units by U+FFFD "Replacement character". This is similar to
Cory's "ReplaceCheckFailures". Such a policy is necessary if a program
needs to read a corrupted UTF file and make the most out of it.

On the other hand, the current behaviour of throwing an error at
overlong or invalid sequences is also sensible. The one-to-one
relation between encoded and decoded form makes it the safest choice.
It can guarantee there are no NULLs in the decoded form that were not
in the encoded form.

I think both of these policies (and possibly others that I haven't
thought of) will need to be supported. Checking policies are therefore
not just an optimisation.

> Conflating the two will lead to bad design. I agree with Rogier.

That is great...

> The
> routines should make checking a policy. Iterators should be non-checked.
> Checked iterators can be adaptors.

... but I'm not sure I understand what you mean. I read this as "you
can build a checking iterator adaptor on top of an non-checking
iterator adaptor". I don't think this is true for decoding UTF. I
suspect, therefore, that I misunderstand something.

Cheers,
Rogier


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk