|
Boost : |
From: Alberto Barbati (abarbati_at_[hidden])
Date: 2003-12-01 17:27:04
Hi Robert,
I had a look at the serialization library. Looks very good! I guess I am
going adopt it right now! Good job.
I have a few remarks to do about the utf8_codecvt_facet. According to
the documentation (and the intent?) that class should help enconding
Unicode text. In fact, what it does is to (correctly) encode ISO/IEC
10646, which is often mistaken for Unicode but it is a bit different.
The main differences are that Unicode imposes additional restrictions
and semantics on character. The canonical encoding of Unicode is UTF-32
and not UCS-4, which is indeed the canonical encoding of ISO/IEC 10646.
Although the two encodings are essentially identical, UTF-32 allow only
scalar values below 0x10ffff, while UCS-4 allows values up to
0x7fffffff. More important, when encoding Unicode, the following UTF-8
sequences are ill-formed and should be treated as errors, if found in a
stream:
1) all sequences of more than 4 bytes
2) all "non-shortest" sequences (see
<http://www.unicode.org/versions/corrigendum1.html>)
3) all sequences that encodes a surrogate (characters from U+D800 to U+DFFF)
4) all sequences encoding a non-character (for example U+FFFF)
Several months ago, I submitted a proposal of an UTF library. It should
still be in the boost file area. I received good feedback and also very
reasonable criticism that I should have taken into account, but I did
not have time to fix things and resubmit. However, the UTF-8 facet was
working good and I believe I could propose it as a drop-in replacement
for your facet.
By the way, I believe that the UTF facet should be in a library separate
from Serialization. What I could do if there is interest in this
direction is to re-submit the UTF library with the UTF-8 facet only, so
that it may be immediately available for the serialization library.
Being smaller and more focused than the entire UTF library, it may be
reviewed and hopefully approved quickly. In a second time, I might
address the problems with the other facets and include them in a
possibly subsequent release of the library.
What do you think?
Alberto Barbati
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk