Subject: [Boost-bugs] [Boost C++ Libraries] #12076: A couple issues matching with unicode regular expressions (word delimiters, brackets)
From: Boost C++ Libraries (noreply_at_[hidden])
Date: 2016-03-19 13:57:36
#12076: A couple issues matching with unicode regular expressions (word delimiters,
brackets)
------------------------------+-------------------------
Reporter: anonymous | Owner: johnmaddock
Type: Bugs | Status: new
Milestone: To Be Determined | Component: regex
Version: Boost 1.61.0 | Severity: Problem
Keywords: |
------------------------------+-------------------------
Hi,
The [https://github.com/mawww/kakoune/ kakoune] code editor uses boost-
regex in order to search through a file using a regular expression, and
I've stumbled upon some issues which I think are related to how boost
handles unicode codepoints.
The syntax used is the Perl one.
First, the `\b` word delimiter doesn't seem to work when involving unicode
characters, some strings that should be matched are not e.g. "abc†123"
with the pattern "â€\b".
Secondly, using the "." pattern on strings that contain unicode seems to
select bytes, and not entire codepoints e.g. "â€" with the pattern "." will
select two bytes.
Finally, using bracket around unicode characters does not work, for
example "[â€â€œ]. This issue is probably related to the one above.
I have had a look at the documentation, namely the
[http://www.boost.org/doc/libs/1_60_0/libs/regex/doc/html/boost_regex/unicode.html
Unicode & boost.regex] /
[http://www.boost.org/doc/libs/1_60_0/libs/regex/doc/html/boost_regex/syntax/character_classes/optional_char_class_names.html
Characters classes supported by Unicode regular expressions] pages, but
I'm not sure if they are related to the issues above (please let me know
if I missed something).
Thanks.
-- Ticket URL: <https://svn.boost.org/trac/boost/ticket/12076> Boost C++ Libraries <http://www.boost.org/> Boost provides free peer-reviewed portable C++ source libraries.
This archive was generated by hypermail 2.1.7 : 2017-02-16 18:50:19 UTC