Boost logo

Boost :

Subject: Re: [boost] [regex] How robust are the <boost/regex/pending/unicode_iterator.hpp> adapters?
From: Dave Abrahams (dave_at_[hidden])
Date: 2011-07-19 10:39:10


on Tue Jul 19 2011, John Maddock <boost.regex-AT-virgin.net> wrote:

>>>> Boost.Filesystem needs the UTF-32 to UTF-16 and UTF-16 to UTF-32
>>>> adapters to implement char16_t and char32_t support. Do they have any
>>>> known bugs or other outstanding problems?
>>>
>>> Yes, they can read past the end of your input range if it contains
>>> invalid
>>> data at the end.
>>
>> Interesting. Would a fix be difficult?
>
> I was about to say there aren't any known issues, but yes that is a
> problem - and a fix would mean changing the interface - the problem
> comes because the iterators only store the current position in the
> underlying sequence and assumes that they can increment or decrement
> over a complete multi-byte sequence. So if your underlying sequence
> contains a *truncated* multibye sequence at the start or end of the
> string then they can read past-the-end or even past-the-start :-(
>
> The only real fix is to redesign them to be range-based, so we can add
> the additional checks necessary, but of course this also makes them
> more heavyweight than they are at present. I guess I was hoping we
> would have had a proper Unicode library for this by now (in Boost that
> is, not the sandbox ;)

What about just asking people who aren't sure if they're processing
invalid unicode to add some sentinel bytes? Wouldn't that work?

-- 
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk