Boost logo

Boost :

From: Alberto Barbati (abarbati_at_[hidden])
Date: 2002-11-17 16:49:23


Robert Ramey wrote:
> 1) changing the state of the stream while serializing. My implementation initialized the stream
> and never contemplated that the same stream might be used for other things. That is that
> serialized data might be "embedded" as part of a larger stream.
>
> Apparently this is an issue for some people. I don't see it as a large issue but
> it as easy to address.

In fact the issue is so easy to address that I don't understand why we
are still discussing about it :) If you are willing to accept my
solution, please say so immediately, so we won't waste any more time.

> One method of storing/recovering the data is to use a sequence of characters
> or wide characters. That is a C++ stream.
>
> This has some major benefits:
>
> a) All the code required to convert any C++ datatype into characters or wide
> characters exists and is part of the standard library and is guarenteed to work.

This is not true, and I proved it to you with a code snippet in a recent
post of mine. The standard *does not* provide a way to output (i.e.: to
write on a disk file) a stream of wide characters. You can put wide
characters into a wide stream but you will always obtain a file of
"narrow" characters, obtained through a "degenerate conversion" as
explictly specified in the standard.

Moreover, I have very bad news. I just found that the C++ implementation
shipped with .NET is not conformant on this point. Consider the
following program:

int main()
{
     std::wofstream out("test.txt", std::ios::binary);
     out << L"I owe you \x20ac 1\n"; // \x20ac is the Euro sign
     return 0;
}

On .NET with STLport you get the incorrect, but ANSI-conforming, result:

"I owe you ¬ 1"

'¬' being the character of ASCII code 0xac. On .NET with its native STL
implementation you get

"I owe you "

the program chokes when writing the Euro sign and leaves the stream in
"failed" state :( Here Microsoft seems to have really screwed up something.

>>>Another observation:
>>>
>>>I note that my test.cpp program includes wchar_t member variables initialized
>>>to values in excess of 256.
>>>The system doesn't seem to lose any informaton in storing/loading to a stream
>>> with classic locale.
>>
>
> I double checked.
>
> I have functions in both char and wchar_t versions of text archives to handle both strings
> of chars and wstrings. This created a couple of problems. The most obvious was what about
> strings containing embedded blanks. - and other punctuation. Single characters such a space
> was also a problem. First I implemented them a sequence of short integers. That worked
> fine but I was concerned that it wasted space, was slow, and inconvenient for debugging.
> So I made special functions for i/o of string and wstring which just write a string length
> and then stream out the string buffer as binary.
>
> So I never have the problem that unicode or local o anything else interfers with my serialization.
> This is a side effect of the fact that the usage of the stream was carefully limited to the purpose at hand.

You should triple check, then. Following my previous example, this program:

int main()
{
     std::wstring outs(L"I owe you \x20ac 1"), ins;

     {
         std::wofstream out("test.txt", std::ios::binary);
         boost::woarchive ar(out);
         ar << outs;
     }

     {
         std::wifstream in("test.txt", std::ios::binary);
         boost::wiarchive ar(in);
         ar >> ins;
     }

     assert(outs == ins);
     return 0;
}

fails on at least two platforms (.NET/native STL and .NET/STLport), in
two different ways.

> Of course this raises the question why support wstreams at all? We're not using its advantages
> (unless we have a lot of unicode text to store) and it doubles the required space.

Let's replace wide streams and archives with narrow ones in the previous
example. The program indeed run successfully on both STLport and .NET
native STL, but let's have a look at the archive file:

---begin file
22 serialization::archive 1
0 1 13 73 32 111 119 101 32 121 111 117 32 8364 32 49
---end file

this alternative requires from 2 to 6 (six!) bytes per Unicode
character. Even up to 12 if you use surrogates, that become 8 if your
wchar_t is 32-bit wide (:o another platform-specific issue has leaked
in!). If I had lots of Unicode strings I would have no doubt about which
is the better solution.

I hope you realize that Unicode output is a lot more complex than it
seems. I am just asking you to allow the programmer to avoid overriding
the locale, which still can be the default option. Am I asking too much?

Alberto Barbati


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk