Boost logo

Boost Users :

From: Eric Niebler (eric_at_[hidden])
Date: 2008-08-25 16:56:54


Allan Odgaard wrote:
> It looks like the traits aspect of Xpressive is geared toward
> characters, so I assume that Xpressive is not directly usable with UTF-8
> encoded text, am I correct?

Correct.

> It might work by having the character type be a 32 bit integer and then
> use iterator adapters which expose the sequence as ucs-4 code points
> (after all, the sequence is “encoded”),

Right, and such iterator adaptors already exist in
boost/regex/pending/unicode_iterator.hpp. I've never tried to use them
with xpressive, however.

> but that leads me to the next
> question: diacritics.
>
> For example something like é in decomposed unicode is two code points (e
> followed by a combining ´ mark), so even when the sequence is iterated
> as ucs-4 code points, a regexp of “.” will match just the e, not the
> actual (rendered) character.

I'm afraid your analysis is correct.

> Since I was unable to find any discussion of this while searching for
> Xpressive, I am curious to hear if any thoughts have gone into these
> issues.

Xpressive is not Unicode-aware. It's been on my ToDo list forever, but
it's a huge job and I don't foresee myself having the time to devote to
this in the near future. If you could make a prioritized list of the
features you'd like, it would help.

-- 
Eric Niebler
BoostPro Computing
http://www.boostpro.com

Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net