|
Boost Users : |
From: Ovanes Markarian (om_boost_at_[hidden])
Date: 2005-07-20 13:29:21
Ok, thanks for the answer.
What do you think? Could boost regex make usage of such traits_class or you would not like to
include it into the distribution?
On Wed, July 20, 2005 19:02, John Maddock said:
>>> There are several options:
>>>
>>> 1) Convert the characters on the fly to *wchar_t* and use boost::wregex,
>>> it's a trivial widening of your 16-bit characters, so nothing will get
>>> lost.
>>> You could probably use transform_iterator for such a task.
>> That's possible, the only problem is that *wchar_t* is not allways 2 bytes
>> long. At least I read
>> it at Xerces-C Build Instructions page at
>> http://xml.apache.org/xerces-c/build-misc.html (What
>> should I define XMLCh to be?). Here is an excerpt:
>
> Hey, stop right there! I said use an adapter, not a cast:
>
> template <class Iterator>
> struct my_adapter
> {
> my_adapter(Iterator p) : m_position(p){}
> wchar_t operator*()const { return *m_position; }
> my_adapter& operator++() { m_position++; return *this; }
>
> // other members to make this a valid iterator go here...
>
> private:
> m_position;
> };
>
> Then pass my_adapter's as the iterator type to the regex algorithms, rather
> than a XMLCh*, for example:
>
> bool is_regex_present(XMLCh const* p, int len, boost::wregex const&e)
> {
> my_adapter<XMLCh*> i(p), j(p+len);
> return boost::regex_search(i, j, e);
> }
>
I was not going to use a cast, but I was talking about the following: even if I use *wchar_t* I
still need a platform dependent conversion from one character type to another. Since *wchar_t* is
platform dependent.
>>> 2) In Boost 1.33 there will be more [optional] support for Unicode, but
>>> it
>>> requires that you use the ICU library
>>> (http://www.ibm.com/software/globalization/icu/) to provide some of the
>>> basics. You can then correctly scan 16-bit Unicode code sequences, and
>>> have
>>> surrogate pairs correctly handled, as well as have access to the Unicode
>>> property names in regexes etc. However the character type for 16-bit
>>> code
>>> points is either unsigned short or wchar_t depending upon the platform
>>> (this
>>> is a requirement for interoperablity with ICU), so you may have to fiddle
>>> with your XMLCh setup to get everything working smoothly. See
>>> http://cvs.sourceforge.net/viewcvs.py/*checkout*/boost/boost/libs/regex/doc/icu_strings.html?rev=1.3
>>>
>> Ok, I understand. But then I possibly need to make conversions again
>> (dependent on the platform).
>> May be it would be better to offer an independent way of handling
>> characters. As you have already
>> mentioned the 3d possiblity.
>
> Actually probably not: Xerces can be built with ICU support see :
> http://xml.apache.org/xerces-c/build-misc.html#ICUPerl so if you define
> XMLCh to be the same type as ICU's UChar data type, then no conversions are
> required.
There are too many developers involved in the process, that we force all to recompile Xerces-C
with specific settings. I don't think this would be an option for us. In our case it can also lead
to unpredictable results, if one replaces xerces-c with freshly compiled xerces-c without icu
support. I am a little bit sceptical about this.
>
>>> 3) You could define your own regex traits class for the character type
>>> that
>>> you're using: if you go down this road then make sure that you start with
>>> Boost-1.33 as it has better docs in this area, as well as redesigned
>>> traits
>>> class requirements compared to 1.32.
>> Can I read more about it? Can you point me to a document which describes
>> the traits class? What
>> are the special key points of this class. I tried to take a look at the
>> sources, but it was hardly
>> to understand what is what, since there are not so many comments and a lot
>> of typedefs which are
>> hard to backtrace.
>
> The traits class requirement are here:
> http://cvs.sourceforge.net/viewcvs.py/*checkout*/boost/boost/libs/regex/doc/concepts.html#traits.
>
> I should warn you it's still quite a bit of work to support a new character
> type.
I think I should give it a try. ;)
>
> John.
>
> _______________________________________________
> Boost-users mailing list
> Boost-users_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/boost-users
>
With Kind Regards,
Ovanes
Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net