From: Kirit Sælensminde (kirit.saelensminde_at_[hidden])
Date: 2008-04-01 03:46:38
Robin Redeker wrote:
> On Thu, Mar 20, 2008 at 10:10:07PM +0100, Esteve Fernandez wrote:
>> - what about Unicode? I know that Boost.Regex supports Unicode if compiled
>> against ICU and the JSON spec states that everything must be in Unicode
>> (correct me if I'm wrong)
> Yes, the JSON spec states that JSON is Unicode text, encoded in (any)
> Unicode encoding (usually UTF-8). However, there is one hard part when
> writing a JSON parser, you have to take care to handle the \uXXXX
> literals in strings correctly. The JSON spec (RFC 4627,
> http://www.ietf.org/rfc/rfc4627.txt ) states in section 2.5:
> To escape an extended character that is not in the Basic
> Multilingual Plane, the character is represented as a
> twelve-character sequence, encoding the UTF-16 surrogate pair.
> So, for example, a string containing only the G clef character
> (U+1D11E) may be represented as "\uD834\uDD1E".
Sorry I'm a bit late to this - only just got around to reading this thread.
I have a JSON string parser that handles this correctly by parsing into
a UTF-16 buffer which can then be re-encoded to the required string type
and encoding. I described it to Thomas Jensen so he could use it if he
wanted to in TinyJSON, but anybody else should feel free to grab it if
it's useful too.
The parser is in the bottom half of this page:
On the same page are some notes about how I store JSON objects. Probably
not of interest for Boost.Serialization though. In any case I think ICU
may be able to provide some suitable string encoding functions that the
string parser could be parametrised on.
By co-incidence I'd picked exactly the same test case as the standard
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk