Subject: [Boost-bugs] [Boost C++ Libraries] #12076: A couple issues matching with unicode regular expressions (word delimiters, brackets)
From: Boost C++ Libraries (noreply_at_[hidden])
Date: 2016-03-19 13:57:36
#12076: A couple issues matching with unicode regular expressions (word delimiters,
brackets)
------------------------------+-------------------------
Reporter: anonymous | Owner: johnmaddock
Type: Bugs | Status: new
Milestone: To Be Determined | Component: regex
Version: Boost 1.61.0 | Severity: Problem
Keywords: |
------------------------------+-------------------------
Hi,
The [https://github.com/mawww/kakoune/ kakoune] code editor uses boost-
regex in order to search through a file using a regular expression, and
I've stumbled upon some issues which I think are related to how boost
handles unicode codepoints.
The syntax used is the Perl one.
First, the `\b` word delimiter doesn't seem to work when involving unicode
characters, some strings that should be matched are not e.g. "abcâ 123"
with the pattern "â\b".
Secondly, using the "." pattern on strings that contain unicode seems to
select bytes, and not entire codepoints e.g. "â" with the pattern "." will
select two bytes.
Finally, using bracket around unicode characters does not work, for
example "[ââ]. This issue is probably related to the one above.
I have had a look at the documentation, namely the
[http://www.boost.org/doc/libs/1_60_0/libs/regex/doc/html/boost_regex/unicode.html
Unicode & boost.regex] /
[http://www.boost.org/doc/libs/1_60_0/libs/regex/doc/html/boost_regex/syntax/character_classes/optional_char_class_names.html
Characters classes supported by Unicode regular expressions] pages, but
I'm not sure if they are related to the issues above (please let me know
if I missed something).
Thanks.
-- Ticket URL: <https://svn.boost.org/trac/boost/ticket/12076> Boost C++ Libraries <http://www.boost.org/> Boost provides free peer-reviewed portable C++ source libraries.
This archive was generated by hypermail 2.1.7 : 2017-02-16 18:50:19 UTC