Boost logo

Boost Users :

From: John Maddock (john_at_[hidden])
Date: 2005-07-20 05:05:26


> I also saw a post http://lists.boost.org/boost-users/2003/09/5095.php
> where John answered that it
> is better to convert these character sequences on-the-fly to char. Somehow
> I don't like this
> approach, since I believe that with wrong encoding set on the system some
> information might get
> lost.
>
> Is it possible to use XMLCh as character traits in the regular expression
> if XMLCh* points to a
> null-terminated 2 bytes character sequence?

There are several options:

1) Convert the characters on the fly to *wchar_t* and use boost::wregex,
it's a trivial widening of your 16-bit characters, so nothing will get lost.
You could probably use transform_iterator for such a task.

2) In Boost 1.33 there will be more [optional] support for Unicode, but it
requires that you use the ICU library
(http://www.ibm.com/software/globalization/icu/) to provide some of the
basics. You can then correctly scan 16-bit Unicode code sequences, and have
surrogate pairs correctly handled, as well as have access to the Unicode
property names in regexes etc. However the character type for 16-bit code
points is either unsigned short or wchar_t depending upon the platform (this
is a requirement for interoperablity with ICU), so you may have to fiddle
with your XMLCh setup to get everything working smoothly. See
http://cvs.sourceforge.net/viewcvs.py/*checkout*/boost/boost/libs/regex/doc/icu_strings.html?rev=1.3

3) You could define your own regex traits class for the character type that
you're using: if you go down this road then make sure that you start with
Boost-1.33 as it has better docs in this area, as well as redesigned traits
class requirements compared to 1.32.

Hope this helps,

John.


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net