Boost logo

Boost :

From: John Maddock (john_at_[hidden])
Date: 2005-07-01 05:41:28


> Also, I may have found another issue, closely related to the one under
> discussion. It regards case-insensitive matching of named character
> classes. The regex_traits<> provides two functions for working with
> named char classes: lookup_classname and isctype. To match a char class
> such as [[:alpha:]], you pass "alpha" to lookup_classname and get a
> bitmask. Later, you pass a char and the bitmask to isctype and get a
> bool yes/no answer.
>
> But how does case-insensitivity work in this scenario? Suppose we're
> doing a case-insensitive match on [[:lower:]]. It should behave as if it
> were [[:lower:][:upper:]], right? But there doesn't seem to be enough
> smarts in the regex_traits interface to do this.

I've always thought that a case insensitive match for [[:lower:]] was an
abomination frankly, but here's how I currently handle it:

If the final bitmask contains all of the bits of the mask returned by
lookup_classname("lower") or all the bits of the mask retruned by
lookup_classname("upper") then I or the mask with the result of
lookup_classname("alpha").

> Imagine I write a traits class which recognizes [[:fubar:]], and the
> "fubar" char class happens to be case-sensitive. How is the regex engine
> to know that? And how should it do a case-insensitive match of a
> character against the [[:fubar:]] char class? John, can you confirm this
> is a legitimate problem?

OK, user defined classes may be an issue (see below).

> I see two options:
>
> 1) Add a bool icase parameter to lookup_classname. Then,
> lookup_classname( "upper", true ) will know to return lower|upper
> instead of just upper.
>
> 2) Add a isctype_nocase function
>
> I prefer (1) because the extra computation happens at the time the
> pattern is compiled rather than when it is executed.

If we're going to change this then (1) is definitely preferable, it's quite
a small change after all.

In fact I suspect this may be a real bug in the current Boost.Regex Unicode
support: matching a case insensitive [[:Ll:]] will only match lower case
letters. Although frankly which of the other L* categories it should match
is an open question: should it match Lo or Lm for example?

Head swimmingly yours, John.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk