Boost logo

Boost :

Subject: Re: [boost] [regex] How robust are the <boost/regex/pending/unicode_iterator.hpp> adapters?
From: Soares Chen Ruo Fei (crf_at_[hidden])
Date: 2011-07-20 05:14:13


On Tue, Jul 19, 2011 at 4:24 PM, John Maddock <boost.regex_at_[hidden]> wrote:
>>> Yes, they can read past the end of your input range if it contains
>>> invalid
>>> data at the end.
>>
>> Interesting. Would a fix be difficult?
>
> I was about to say there aren't any known issues, but yes that is a problem
> - and a fix would mean changing the interface - the problem comes because
> the iterators only store the current position in the underlying sequence and
> assumes that they can increment or decrement over a complete multi-byte
> sequence.  So if your underlying sequence contains a *truncated* multibye
> sequence at the start or end of the string then they can read past-the-end
> or even past-the-start :-(
>
> The only real fix is to redesign them to be range-based, so we can add the
> additional checks necessary, but of course this also makes them more
> heavyweight than they are at present.  I guess I was hoping we would have
> had a proper Unicode library for this by now (in Boost that is, not the
> sandbox ;)
>
> Oh well, maybe I should just bite the bullet and change/fix this hole.

In my GSoC project I am currently developing a Unicode string adapter
library that wraps and add Unicode awareness to conventional string
types such as std::string. Not sure if that helps but if you are
developing new library APIs I think this might be useful. I still have
not completed the documentation but you can look at the draft at
http://crf.scriptmatrix.net/ustr/. The code repository is available at
GitHub: https://github.com/crf00/boost.ustr.

(Sorry, no means to hijack the thread but hope that helps.)

cheers,

Soares Chen


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk