Boost logo

Boost :

Subject: Re: [boost] [regex] How robust are the <boost/regex/pending/unicode_iterator.hpp> adapters?
From: John Maddock (boost.regex_at_[hidden])
Date: 2011-07-19 04:24:56


>>> Boost.Filesystem needs the UTF-32 to UTF-16 and UTF-16 to UTF-32
>>> adapters to implement char16_t and char32_t support. Do they have any
>>> known bugs or other outstanding problems?
>>
>> Yes, they can read past the end of your input range if it contains
>> invalid
>> data at the end.
>
> Interesting. Would a fix be difficult?

I was about to say there aren't any known issues, but yes that is a
problem - and a fix would mean changing the interface - the problem comes
because the iterators only store the current position in the underlying
sequence and assumes that they can increment or decrement over a complete
multi-byte sequence. So if your underlying sequence contains a *truncated*
multibye sequence at the start or end of the string then they can read
past-the-end or even past-the-start :-(

The only real fix is to redesign them to be range-based, so we can add the
additional checks necessary, but of course this also makes them more
heavyweight than they are at present. I guess I was hoping we would have
had a proper Unicode library for this by now (in Boost that is, not the
sandbox ;)

Oh well, maybe I should just bite the bullet and change/fix this hole.

John.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk