Subject: Re: [Boost-bugs] [Boost C++ Libraries] #1678: Boost.property_tree::read_xml does not parse UNICODE file with BOMs
From: Boost C++ Libraries (noreply_at_[hidden])
Date: 2011-02-18 21:26:23
#1678: Boost.property_tree::read_xml does not parse UNICODE file with BOMs
--------------------------------------+-------------------------------------
Reporter: tom | Owner: cornedbee
Type: Patches | Status: assigned
Milestone: Boost 1.47.0 | Component: property_tree
Version: Boost Development Trunk | Severity: Showstopper
Resolution: | Keywords: property_tree UNICODE BOM read_xml
--------------------------------------+-------------------------------------
Comment (by cornedbee):
No. These tests are invalid because of the way the test system works, and
because of the way PTree's XML support works.
PTree's XML parser doesn't insulate you from encoding issues. In fact, I
have an error in my test case in that I specify UTF-16 as the encoding of
the XML snippet. That's incorrect: the encoding is UTF-8 in the test file
that is created. And PTree doesn't care anyway, because all PTree does is
read data from an input stream (wistream in the wchar_t case) and process
it, assuming that it's in the platform default encoding for this character
type. The encoding declaration of the XML is completely ignored.
So the only thing that does encoding conversion is the input stream. The
test cases install a UTF-8 conversion facet in the global locale, so that
the wide stream tests expect the input files to contain UTF-8. Any other
encoding would require replacing the code conversion facet for such tests,
and wouldn't work at all for the narrow version, because narrow streams
don't transcode AFAIK.
Yes, this is technically invalid handling of XML. But that's a completely
different issue and has nothing to do with the BOM issue here. That would
be a much bigger issue: it would mean that the library would have to load
the file as a binary block, detect the encoding, transcode the data to the
native encoding for the given character type, and only then actually parse
the XML.
Sorry, but I'm not going to do that. I may do it if Boost ever has a
usable encoding handling library that I can use, but not before that.
This bug is about reading UTF-8 files that contain a BOM with a wide-
character property_tree. I've fixed this bug by correctly skipping the BOM
for wchar_t sequences under the assumption that the input stream has
correctly converted whatever was on disk to the native encoding for
wchar_t, which is further assumed to be native-endian UTF-16/32. That's
actually a precondition for the XML parser, even though it's probably not
documented.
But poor documentation is also another bug.
-- Ticket URL: <https://svn.boost.org/trac/boost/ticket/1678#comment:8> Boost C++ Libraries <http://www.boost.org/> Boost provides free peer-reviewed portable C++ source libraries.
This archive was generated by hypermail 2.1.7 : 2017-02-16 18:50:05 UTC