Boost logo

Boost Users :

From: Ovanes Markarian (om_boost_at_[hidden])
Date: 2005-07-20 06:44:33


John,

many thanks for your answer. I would like to comment some of your points.

On Wed, July 20, 2005 12:05, John Maddock said:
>> I also saw a post http://lists.boost.org/boost-users/2003/09/5095.php
>> where John answered that it
>> is better to convert these character sequences on-the-fly to char. Somehow
>> I don't like this
>> approach, since I believe that with wrong encoding set on the system some
>> information might get
>> lost.
>>
>> Is it possible to use XMLCh as character traits in the regular expression
>> if XMLCh* points to a
>> null-terminated 2 bytes character sequence?
>
> There are several options:
>
> 1) Convert the characters on the fly to *wchar_t* and use boost::wregex,
> it's a trivial widening of your 16-bit characters, so nothing will get lost.
> You could probably use transform_iterator for such a task.
That's possible, the only problem is that *wchar_t* is not allways 2 bytes long. At least I read
it at Xerces-C Build Instructions page at http://xml.apache.org/xerces-c/build-misc.html (What
should I define XMLCh to be?). Here is an excerpt:
...
Unlike XMLCh, the encoding of wchar_t is platform dependent. Sometimes it is utf-16 (AIX,
Windows), sometimes ucs-4 (Solaris, Linux), sometimes it is not based on Unicode at all (HP/UX,
AS/400, system 390).
...

In former releases it was defined as wchar_t, but there Apache developers decided to abonden it
because of:
...
- Portability problems with any code that assumes that the types of XMLCh and wchar_t are compatible
- Excessive memory usage, especially in the DOM, on platforms with 32 bit wchar_t.
- utf-16 encoded XMLCh is not always compatible with ucs-4 encoded wchar_t on Solaris and Linux.
The problem occurs with Unicode characters with values greater than 64k; in ucs-4 the value is
stored as a single 32 bit quantity. With utf-16, the value will be stored as a "surrogate pair" of
two 16 bit values. Even with XMLCh equated to wchar_t, xerces will still create the utf-16 encoded
surrogate pairs, which are illegal in ucs-4 encoded wchar_t strings.
...

>
> 2) In Boost 1.33 there will be more [optional] support for Unicode, but it
> requires that you use the ICU library
> (http://www.ibm.com/software/globalization/icu/) to provide some of the
> basics. You can then correctly scan 16-bit Unicode code sequences, and have
> surrogate pairs correctly handled, as well as have access to the Unicode
> property names in regexes etc. However the character type for 16-bit code
> points is either unsigned short or wchar_t depending upon the platform (this
> is a requirement for interoperablity with ICU), so you may have to fiddle
> with your XMLCh setup to get everything working smoothly. See
> http://cvs.sourceforge.net/viewcvs.py/*checkout*/boost/boost/libs/regex/doc/icu_strings.html?rev=1.3
>
Ok, I understand. But then I possibly need to make conversions again (dependent on the platform).
May be it would be better to offer an independent way of handling characters. As you have already
mentioned the 3d possiblity.

> 3) You could define your own regex traits class for the character type that
> you're using: if you go down this road then make sure that you start with
> Boost-1.33 as it has better docs in this area, as well as redesigned traits
> class requirements compared to 1.32.
Can I read more about it? Can you point me to a document which describes the traits class? What
are the special key points of this class. I tried to take a look at the sources, but it was hardly
to understand what is what, since there are not so many comments and a lot of typedefs which are
hard to backtrace.

I am very thankful for such a nice library and your effort. Many thanks for your time.

>
> Hope this helps,
>
> John.
>
> _______________________________________________
> Boost-users mailing list
> Boost-users_at_[hidden]
> http://lists.boost.org/mailman/listinfo.cgi/boost-users
>

With Kind Regards,

Ovanes


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net