Re: [Boost-bugs] [Boost C++ Libraries] #1678: Boost.property_tree::read_xml does not parse UNICODE file with BOMs

Subject: Re: [Boost-bugs] [Boost C++ Libraries] #1678: Boost.property_tree::read_xml does not parse UNICODE file with BOMs
From: Boost C++ Libraries (noreply_at_[hidden])
Date: 2011-02-18 21:26:23


#1678: Boost.property_tree::read_xml does not parse UNICODE file with BOMs
--------------------------------------+-------------------------------------
  Reporter: tom | Owner: cornedbee
      Type: Patches | Status: assigned
 Milestone: Boost 1.47.0 | Component: property_tree
   Version: Boost Development Trunk | Severity: Showstopper
Resolution: | Keywords: property_tree UNICODE BOM read_xml
--------------------------------------+-------------------------------------

Comment (by cornedbee):

 No. These tests are invalid because of the way the test system works, and
 because of the way PTree's XML support works.

 PTree's XML parser doesn't insulate you from encoding issues. In fact, I
 have an error in my test case in that I specify UTF-16 as the encoding of
 the XML snippet. That's incorrect: the encoding is UTF-8 in the test file
 that is created. And PTree doesn't care anyway, because all PTree does is
 read data from an input stream (wistream in the wchar_t case) and process
 it, assuming that it's in the platform default encoding for this character
 type. The encoding declaration of the XML is completely ignored.

 So the only thing that does encoding conversion is the input stream. The
 test cases install a UTF-8 conversion facet in the global locale, so that
 the wide stream tests expect the input files to contain UTF-8. Any other
 encoding would require replacing the code conversion facet for such tests,
 and wouldn't work at all for the narrow version, because narrow streams
 don't transcode AFAIK.

 Yes, this is technically invalid handling of XML. But that's a completely
 different issue and has nothing to do with the BOM issue here. That would
 be a much bigger issue: it would mean that the library would have to load
 the file as a binary block, detect the encoding, transcode the data to the
 native encoding for the given character type, and only then actually parse
 the XML.

 Sorry, but I'm not going to do that. I may do it if Boost ever has a
 usable encoding handling library that I can use, but not before that.

 This bug is about reading UTF-8 files that contain a BOM with a wide-
 character property_tree. I've fixed this bug by correctly skipping the BOM
 for wchar_t sequences under the assumption that the input stream has
 correctly converted whatever was on disk to the native encoding for
 wchar_t, which is further assumed to be native-endian UTF-16/32. That's
 actually a precondition for the XML parser, even though it's probably not
 documented.

 But poor documentation is also another bug.

-- 
Ticket URL: <https://svn.boost.org/trac/boost/ticket/1678#comment:8>
Boost C++ Libraries <http://www.boost.org/>
Boost provides free peer-reviewed portable C++ source libraries.

This archive was generated by hypermail 2.1.7 : 2017-02-16 18:50:05 UTC