From: David Abrahams (dave_at_[hidden])
Date: 2005-11-22 14:23:28
"Robert Ramey" <ramey_at_[hidden]> writes:
> David Abrahams wrote:
>> "Robert Ramey" <ramey_at_[hidden]> writes:
>>> I have one question about this.
>>> What is the ultimate purpose. That is it just to optimize
>>> serialization of certains types of collections of bit streamable
>>> objects? or does have some more ambitious goal.
>> I thought I highlighted the ultimate purpose quite clearly already:
>> | For many archive formats and common datatypes there exist APIs
>> | that can quickly read or write contiguous sequences of those types
>> | all at once (**). Reading or writing such a sequence by
>> | separately reading or writing each element (as the serialization
>> | library currently does) can be an order of magnitude more
>> | expensive.
>> We want to be able to capitalize on the existence of those APIs, and
>> to do that we need a "hook" that will be used whenever a contiguous
>> sequence is going to be (de)serialized. No such hook exists in
>> (**) Note that this capability is not necessarily tied to bitwise
>> serialization or the use of a binary representation.
>> In particular, I took special pains to clarify above (**) that this is
>> *not* merely about "serialization of certains types of collections of
>> bit streamable objects."
>> If that's unclear, maybe you could ask some specific questions so that
>> I know what needs to be clarified.
> Could you give some other examples? Other than bit serializable
> types which can benefit from using binary read/write - none other
> have occurred to me.
Well, it's not clear what exactly you mean by "bit serializable," but
I assume you're referring to types other than PODs.
Here's just one very trivial example: imagine a Unicode library
includes code to serialize and deserialize arrays of such strings. If
the library is separately compiled, merely crossing the boundary
between the Unicode library and serialization code in a loop will
incur the cost of a function call for each element. If you can call a
function in the Unicode library to serialize all the strings in an
array at once, it's a performance win. A data structure containing an
array of many short strings (short strings are very common, thus the
effectiveness of the short string optimization) would benefit from
avoiding that overhead. Furthermore, if you can serialize more
elements within a single function call, you can apply loop unrolling
for dramatic speedups: better than 2x in my tests. Your STL
implementation (if it's any good) does loop unrolling internally to
get this optimization.
> Another thing I'm wondering about is whether any work has been done
> to determine the source of the "10x speed up". For arrays of primitives,
> I would seem that the replacement of a loop of binary reads with
> one binary read of a larger data might explain it.
As mentioned above, that's part of the explanation in some cases. But
that's not the whole story. For example
- In the case of binary serialization you can also save the cost of
repeated per-element comparisons in the stream buffer
implementation to make sure you're not overrunning the buffers.
- In the case of MPI the result of the optimizations is that MPI can
transfer complex data structures directly into the hardware's
communication buffers without making an additional copy in memory.
That's blazingly fast, as it can almost all happen in hardware.
Furthermore, doing anything else is actually not possible in our
application, because there's not enough memory for an in-memory
copy of the data structure.
> If that were the case it might be most fruitful to invest efforts in
> a different kind of i/o stream which only supports read/write but
> doesn't deal with all the operators, code_cvt factets, etc.
Why do you think that would be most fruitful?
> In my personal work, I've found that i/o stream, is very convenient
> - but it is a performance killer for binary i/o.
The use of iostreams for binary I/O is IMO a design mistake --
according to the experts, binary I/O should be done directly to
streambufs. But that's really irrelevant, as it isn't a
performance-limiting factor for us.
> Another possibility is a binary archive which doesn't depend upon
> i/o stream at all but rather fopen, fwrite, etc. In fact, in my own
> work, I've even found that too slow so I had to replace it with my
> own version one step closer to the OS which also exploited asio.h .
> This in turn entailed writing an asio implementation which wraps
> Windows async i/o API calls.
> My guess is that if I wanted to speed up serialization this would be a
> more effective direction.
Why do you think that would be _more_ effective? Did you achieve a
10x speedup by that approach?
> Another thing that I'm curious about is the how much compilers can
> really collapse inline code when its theoretically possible. In he
> case of an array of primitives, things should collapse to a loop of
> stream read calls without even calling anything inside the compiled
> I don't have any realy knowledge as to whether which compilers - if
> any - actually are doing that.
They can all do that; it's just one level of inlining. If inlining
didn't do that it would be almost pointless. That's one reason, for
example, that the STL can compete or beat hand-written code. If
inlining couldn't collapse loops, Blitz++ wouldn't have stood a chance
at beating hand-written FORTRAN
> I guess I could display the disassmbly and maybe it will come to
> that. But for now I think don't have all the information I need to
> understand this.
We'll be happy to try and help you to understand it. Just keep asking
-- Dave Abrahams Boost Consulting www.boost-consulting.com
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk