Boost logo

Boost :

From: Joel de Guzman (joel_at_[hidden])
Date: 2007-03-08 18:47:46


Vyacheslav E. Andrejev wrote:
> Hello Joel,
>
> JG> No, that does not happen. The signed(ness) is ignored by the
> JG> char-set class. The full 8-bits is mapped to a 256-bit bitset. Do
> JG> you see a need for negative char values?
>
> Unfortunately, it does happen. Look at the definition of two functions. The
> first is boost::spirit::utility::impl::construct_chset (in the file $(BOOST)/boost/spirit/utility/impl/chset.ipp):
>
[snips]
>
> It is obvious that for our input "\x9\xA\xD\x20-\xFF" the function "ptr->set(ch,
> next);" will be invoked with the first argument equals to '\x20', i.e. 32,
> and the second equals to '\xff', i.e. -1. Yes, you right, the implementation
> then will try to map these arguments into a bitset of unsigned values. But
> look at the way it is implemented in boost::spirit::basic_chset_8bit<CharT>::set.
> You can find the implementation of in $(BOOST)/boost/spirit/utility/impl/chset/basic_chset.ipp:
>
> basic_chset_8bit<CharT>::set(CharT from, CharT to)
> {
> for (int i = from; i <= to; ++i)
> bset.set((unsigned char)i);
> }
>
> There will be not a single iteration when from == 32 and to == -1.

You are right!

> To correct this issue I suggest change
>
> Char = chset_t("\x9\xA\xD\x20-\xFF");
>
> to
>
> Char = chset_t("\x9\xA\xD\x20-\x7f\x80-\xFF");
>
> The latter one will work independently of sign'ness of the char type.

Your solution is a good workaround for that specific case. However,
in the interest of correcting this (doing the right thing) for the
whole, and especially for Spirit2 currently in development, the
general question is: should we ignore the signedness of a char?

A simple fix is to cast CharT to unsigned char (the handling
in basic_chset_8bit is inadequate in this regard).

But again, the question is: should we ignore the signedness of a char?
Hartmut and I had a discussion and we think the answer is yes.
Expressions like in the XML parser are the normal expected behavior.
However, we're a bit reluctant to enforce this as this would entail
lots of code changes library-wide and might break existing code
using the library. Perhaps the most sensible thing, for now, is to
apply the fix you gave and document this behavior.

I welcome comments and opinions. I'm cross-posting to the Spirit lists.
Here's Vyacheslav's original post (for context):

Vyacheslav E. Andrejev wrote:
> Hello All,
>
> XML grammar parser in boost::serialization has a definition of
> spirit primitive to deal with XML character set. Initialization
> of this pimitive looks like
> following (file $(BOOST)/libs/serialization/src/xml_grammar.cpp):
>
> Char = chset_t("\x9\xA\xD\x20-\xFF");
>
> Obviously, if char type is signed, then \xFF means -1 and the
> above initialization will be equivalent to
>
> Char = chset_t("\x9\xA\xD");

Thank you!

Regards,

-- 
Joel de Guzman
http://www.boost-consulting.com
http://spirit.sf.net

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk