Boost logo

Boost :

From: Eric Niebler (eric_at_[hidden])
Date: 2005-05-17 13:30:35


Following up, after sleeping on this for a bit....

John Maddock wrote:
>> I noticed some surprising behavior with match_partial and
>> regex_search. Consider:
>>
>> regex e("abc|b");
>> string str("ab");
>> smatch what;
>> if(regex_search(str, what, e, match_default | match_partial))
>> {
>> cout << (what[0].matched ? "full" : "partial") << '\n';
>> }
>>
>> This code displays "partial". Clearly, regex_search is bombing out
>> just as soon as it finds a match, partial or otherwise. But in this
>> case, if it kept looking, it would find a full match. My understanding
>> is that full matches are always preferred to partial matches. I
>> couldn't find any discussion about this case in the regex docs or the
>> std proposal. Did I miss it? What's the intention here? Is the std
>> proposal underspecified?
>
>
> It's so underspecified it's not there at all! (it got removed 'cos we
> couldn't figure out the right wording, even though everyone agreed it
> was a useful feature).

Oh, right. I was there. Duh.

>
> As far as current Boost.Regex is concerned: it prefers in order:
>
> 1) The leftmost match.
> 2) A full match.
> 3) The longest match (if it's a POSIX expression), otherwise a "depth
> first search" match (Perl expressions).
>
> It's the "leftmost" bit that's getting you here. To be honest I'm not
> sure what the right thing to do is here, I can imagine situations when
> either a full or a partial match would be the correct answer in this case.

After pondering this for a bit, I am now of the opinion that the current
behavior (bombing out of regex_search on partial matches rather that
searching for a full match) is correct. I figured I'd share my reasoning
and record it here for posterity. There are 2 use cases for match_partial...

<<Interactive user input validation>>

In this case, the only thing that matters is whether the input is
invalid. So it doesn't matter whether we return a full match or a
partial match because they mean the same thing: not invalid.

<<Data pull>>

When matching buffered data, match_partial is used to find matches that
span chunks. Ideally, it should be possible to use a buffering scheme
together with match_partial to find the same matches as if the data
hadn't been chunked. In this case, you want regex_search to quit early
and return a partial match so you can read more data and retry. In the
example I gave above, matching the pattern "abc|b" against the string
"ab", in a data-pull scenario, it's possible that the next chunk of text
begins with a "c", in that case, the leftmost, longest match is "abc"
(where the text "abc" spans the two chunks of text), not "b". Quitting
early with a partial match gives users the chance to retry and find the
leftmost longest match.

-- 
Eric Niebler
Boost Consulting
www.boost-consulting.com

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk