Boost logo

Boost :

Subject: Re: [boost] [regex] How robust are the <boost/regex/pending/unicode_iterator.hpp> adapters?
From: Beman Dawes (bdawes_at_[hidden])
Date: 2011-07-19 07:39:51


On Tue, Jul 19, 2011 at 4:24 AM, John Maddock <boost.regex_at_[hidden]> wrote:
>>>> Boost.Filesystem needs the UTF-32 to UTF-16 and UTF-16 to UTF-32
>>>> adapters to implement char16_t and char32_t support. Do they have any
>>>> known bugs or other outstanding problems?
>>>
>>> Yes, they can read past the end of your input range if it contains
>>> invalid
>>> data at the end.
>>
>> Interesting. Would a fix be difficult?
>
> I was about to say there aren't any known issues, but yes that is a problem
> - and a fix would mean changing the interface - the problem comes because
> the iterators only store the current position in the underlying sequence and
> assumes that they can increment or decrement over a complete multi-byte
> sequence.  So if your underlying sequence contains a *truncated* multibye
> sequence at the start or end of the string then they can read past-the-end
> or even past-the-start :-(

Ouch!

> The only real fix is to redesign them to be range-based, so we can add the
> additional checks necessary, but of course this also makes them more
> heavyweight than they are at present.  I guess I was hoping we would have
> had a proper Unicode library for this by now (in Boost that is, not the
> sandbox ;)
>
> Oh well, maybe I should just bite the bullet and change/fix this hole.

What about moving portions of Mathias Gaunard's Unicode library into
detail? Have you looked at his code in the sandbox?

I'll take a look at that too.

--Beman


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk