Boost logo

Boost Users :

Subject: Re: [Boost-users] Boost Regex: Use boost::uint32_t as charactertype.
From: John Maddock (john_at_[hidden])
Date: 2009-07-14 06:17:35


>>> I just like to know, if you can use a std::vector<boost::uint32_t> as a
>>> source to match regular expressions against it.
>>
>> Yes but... not right out of the box, you would need to provide a traits
>> class so that regex_traits<uint32_t> knows how to interpret unint32_t's
>> as characters.
>>
>> What precisely did you want to do?
>>
>
> Convert UTF-8/UTF-16 to unint32_t then use Regular Expressions as a
> means to parse xml.

If you don't mind depending upon ICU then the regex ICU wrappers will do
that for you, *and* let you operate directly on the UTF-8 byte stream as
well:
http://www.boost.org/doc/libs/1_39_0/libs/regex/doc/html/boost_regex/ref/non_std_strings/icu.html.

However, ICU is a big library to depend upon :-(

A more lightweight alternative if you don't need true Unicode character
classification and case-conversion, would be to implement a lightweight
traits class for basic_regex that either "does nothing" or forwards to the
same methods in regex_traits<char> etc, see:
http://www.boost.org/doc/libs/1_39_0/libs/regex/doc/html/boost_regex/ref/concepts/traits_concept.html.
This is obviously more work, but reduces the code footprint, your call :-)

HTH, John.


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net