Subject: Re: [Boost-bugs] [Boost C++ Libraries] #8304: Regex not matching case in character ranges if collate flag specified
From: Boost C++ Libraries (noreply_at_[hidden])
Date: 2013-03-21 18:46:04
#8304: Regex not matching case in character ranges if collate flag specified
----------------------------------+-----------------------------------------
Reporter: dave@⦠| Owner: johnmaddock
Type: Bugs | Status: closed
Milestone: To Be Determined | Component: regex
Version: Boost 1.51.0 | Severity: Problem
Resolution: invalid | Keywords:
----------------------------------+-----------------------------------------
Comment (by johnmaddock):
> Isn't case insensitive collation different from case sensitive
collation?
Yes, since this is wide characters on Win32, it's basically Unicode
collation: http://www.unicode.org/reports/tr10/#Multi_Level_Comparison
which is based on "levels".
So character shape first, then accent, then case, then some other
differences. That means that 'a', 'Ã ' and 'A' collate next to each other
followed by 'b' and 'B' if case sensitivity is on, while if matching is
case insensitive then 'a' and 'A' are treated as equivalent (ie collate
the same), but 'Ã ' is still separate.
>Your comment seems to suggest that collation is an odd or unusual thing.
We use the regex engine to allow users to search through text and they
expect to be able to use '[a-z]+' and match against 'Années'. I guess we
could add a 'Foreign language support' option to support both cases, I
just wanted to make sure that the Boost implementation was 'correct'.
Sigh... I see where you're coming from, (and why your users would want
that), but regex would have to implement it's own collation algorithm to
support that. You could probably do that yourself actually by using a
custom traits class?
-- Ticket URL: <https://svn.boost.org/trac/boost/ticket/8304#comment:4> Boost C++ Libraries <http://www.boost.org/> Boost provides free peer-reviewed portable C++ source libraries.
This archive was generated by hypermail 2.1.7 : 2017-02-16 18:50:12 UTC