|
Boost Users : |
From: Eric Niebler (eric_at_[hidden])
Date: 2008-08-25 16:56:54
Allan Odgaard wrote:
> It looks like the traits aspect of Xpressive is geared toward
> characters, so I assume that Xpressive is not directly usable with UTF-8
> encoded text, am I correct?
Correct.
> It might work by having the character type be a 32 bit integer and then
> use iterator adapters which expose the sequence as ucs-4 code points
> (after all, the sequence is encoded),
Right, and such iterator adaptors already exist in
boost/regex/pending/unicode_iterator.hpp. I've never tried to use them
with xpressive, however.
> but that leads me to the next
> question: diacritics.
>
> For example something like é in decomposed unicode is two code points (e
> followed by a combining ´ mark), so even when the sequence is iterated
> as ucs-4 code points, a regexp of . will match just the e, not the
> actual (rendered) character.
I'm afraid your analysis is correct.
> Since I was unable to find any discussion of this while searching for
> Xpressive, I am curious to hear if any thoughts have gone into these
> issues.
Xpressive is not Unicode-aware. It's been on my ToDo list forever, but
it's a huge job and I don't foresee myself having the time to devote to
this in the near future. If you could make a prioritized list of the
features you'd like, it would help.
-- Eric Niebler BoostPro Computing http://www.boostpro.com
Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net