Boost Users :

Date view	Thread view	Subject view	Author view

From: John Maddock (john_at_[hidden])
Date: 2005-07-20 12:02:25

Next message: Jonathan Turkanis: "Re: [Boost-users] regex with multi-byte characters"
Previous message: Chris Coleman: "Re: [Boost-users] 1.33 pointer to object serialization"
In reply to: Ovanes Markarian: "Re: [Boost-users] regex with multi-byte characters"
Next in thread: Ovanes Markarian: "Re: [Boost-users] regex with multi-byte characters"
Reply: Ovanes Markarian: "Re: [Boost-users] regex with multi-byte characters"

>> There are several options:
>>
>> 1) Convert the characters on the fly to *wchar_t* and use boost::wregex,
>> it's a trivial widening of your 16-bit characters, so nothing will get
>> lost.
>> You could probably use transform_iterator for such a task.
> That's possible, the only problem is that *wchar_t* is not allways 2 bytes
> long. At least I read
> it at Xerces-C Build Instructions page at
> http://xml.apache.org/xerces-c/build-misc.html (What
> should I define XMLCh to be?). Here is an excerpt:

Hey, stop right there! I said use an adapter, not a cast:

template <class Iterator>
struct my_adapter
{
  my_adapter(Iterator p) : m_position(p){}
  wchar_t operator*()const { return *m_position; }
  my_adapter& operator++() { m_position++; return *this; }

// other members to make this a valid iterator go here...

private:
m_position;
};

Then pass my_adapter's as the iterator type to the regex algorithms, rather
than a XMLCh*, for example:

bool is_regex_present(XMLCh const* p, int len, boost::wregex const&e)
{
my_adapter<XMLCh*> i(p), j(p+len);
return boost::regex_search(i, j, e);
}

>> 2) In Boost 1.33 there will be more [optional] support for Unicode, but
>> it
>> requires that you use the ICU library
>> (http://www.ibm.com/software/globalization/icu/) to provide some of the
>> basics. You can then correctly scan 16-bit Unicode code sequences, and
>> have
>> surrogate pairs correctly handled, as well as have access to the Unicode
>> property names in regexes etc. However the character type for 16-bit
>> code
>> points is either unsigned short or wchar_t depending upon the platform
>> (this
>> is a requirement for interoperablity with ICU), so you may have to fiddle
>> with your XMLCh setup to get everything working smoothly. See
>> http://cvs.sourceforge.net/viewcvs.py/*checkout*/boost/boost/libs/regex/doc/icu_strings.html?rev=1.3
>>
> Ok, I understand. But then I possibly need to make conversions again
> (dependent on the platform).
> May be it would be better to offer an independent way of handling
> characters. As you have already
> mentioned the 3d possiblity.

Actually probably not: Xerces can be built with ICU support see :
http://xml.apache.org/xerces-c/build-misc.html#ICUPerl so if you define
XMLCh to be the same type as ICU's UChar data type, then no conversions are
required.

>> 3) You could define your own regex traits class for the character type
>> that
>> you're using: if you go down this road then make sure that you start with
>> Boost-1.33 as it has better docs in this area, as well as redesigned
>> traits
>> class requirements compared to 1.32.
> Can I read more about it? Can you point me to a document which describes
> the traits class? What
> are the special key points of this class. I tried to take a look at the
> sources, but it was hardly
> to understand what is what, since there are not so many comments and a lot
> of typedefs which are
> hard to backtrace.

The traits class requirement are here:
http://cvs.sourceforge.net/viewcvs.py/*checkout*/boost/boost/libs/regex/doc/concepts.html#traits.

I should warn you it's still quite a bit of work to support a new character
type.

John.

Next message: Jonathan Turkanis: "Re: [Boost-users] regex with multi-byte characters"
Previous message: Chris Coleman: "Re: [Boost-users] 1.33 pointer to object serialization"
In reply to: Ovanes Markarian: "Re: [Boost-users] regex with multi-byte characters"
Next in thread: Ovanes Markarian: "Re: [Boost-users] regex with multi-byte characters"
Reply: Ovanes Markarian: "Re: [Boost-users] regex with multi-byte characters"

Date view	Thread view	Subject view	Author view

Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net