Re: [Boost-bugs] [Boost C++ Libraries] #8883: property_tree JSON reader does not parse unicode characters properly

Date view	Thread view	Subject view	Author view

Subject: Re: [Boost-bugs] [Boost C++ Libraries] #8883: property_tree JSON reader does not parse unicode characters properly
From: Boost C++ Libraries (noreply_at_[hidden])
Date: 2013-09-04 09:05:41

Next message: Boost C++ Libraries: "Re: [Boost-bugs] [Boost C++ Libraries] #8933: on the windows platform async reads with multiple threads can produce random EOF errors"
Previous message: Boost C++ Libraries: "Re: [Boost-bugs] [Boost C++ Libraries] #8883: property_tree JSON reader does not parse unicode characters properly"
In reply to: Boost C++ Libraries: "[Boost-bugs] [Boost C++ Libraries] #8883: property_tree JSON reader does not parse unicode characters properly"
Next in thread: Boost C++ Libraries: "Re: [Boost-bugs] [Boost C++ Libraries] #8883: property_tree JSON reader does not parse unicode characters properly"

#8883: property_tree JSON reader does not parse unicode characters properly
----------------------------------+----------------------------------------
  Reporter: Ronny Krueger | Owner: cornedbee
  <rk@â€¦> | Status: new
      Type: Bugs | Component: property_tree
Milestone: To Be Determined | Severity: Problem
   Version: Boost 1.54.0 | Keywords: property_tree JSON unicode
Resolution: |
----------------------------------+----------------------------------------

Comment (by ecotax@â€¦):

@Lettort: There is a difference betweeen Unicode, specifying 'Ã¤' maps to
code point E4, and the various ways to encode this code point in bits or
bytes. There is UTF-16, encoding this as 00E4 (16 bits, fits in a wide
char), but also UTF-8, encoding this as two bytes, C3 A4.
When parsing a /u00E4, the correct way to handle this depends on what
encoding you want for your string.
If you have a wide string and expect UTF-16, then yes, you'd expect the
wide char 00E4.
If you have a regular string and expect UTF-8, you'd expect the two bytes
C3 A4.

The original bug report states that first writing and then reading 'Ã¤',
the writer (defensively?) writes this using two \u encoded characters,
each being one byte of the UTF-8 encoding. Regardless if this is the best
choice or not, you'd want the reader to handle this in such a way that it
'round-trips' as much as possible, which currently is not the case.

BTW, For future questions/discussions, I guess a site like
stackoverflow.com is more appropriate.

-- 
Ticket URL: <https://svn.boost.org/trac/boost/ticket/8883#comment:3>
Boost C++ Libraries <http://www.boost.org/>
Boost provides free peer-reviewed portable C++ source libraries.

Date view	Thread view	Subject view	Author view

This archive was generated by hypermail 2.1.7 : 2017-02-16 18:50:14 UTC