Boost logo

Boost :

From: Robert Ramey (ramey_at_[hidden])
Date: 2005-11-25 12:19:04

Matthias Troyer wrote:
>> Hi Robert,
>> I'll let Dave comment on the parts where you review his proposal, and
>> will focus on the performance.
>> On Nov 24, 2005, at 6:59 PM, Robert Ramey wrote:
>>> a) It doesn't address the root cause of "slow" performance
>>> of binary archives.
>> I have done the benchmarks you desired last night (see below), and
>> they indeed show that the root cause of slow performance is the
>> individual writing of many small elements instead of "block-writing"
>> of the array in a call to something like save_array.
>>> b) re-implemenation of binary_archive in such a way so as not
>>> to break existing archives would be an error prone process.
>>> The switching between new and old method "should" result
>>> in exactly the same byte sequence. But it could easily
>>> occur that a small subtle change might render archives
>>> create under the previous binary_archive unreadable.
>> Dave's design does not change anything in your archives or
>> serialization functions, but only adds an additional binary archive
>> using save_array and load_array.

Hmm - that's not the way I read it. I've touched on this in another post.

>>> c) The premise that one will save a lot of coding
>>> (see d) above) compared to to the current method
>>> of overloading based on the pair of archive/type
>>> is overyly optimistic.
>> Actually I have implemented two new archive classes (MPI and XDR)
>> which can profit from it, and it does save lots of code duplication.
>> All of the serialization functions for types that can make use of
>> such an optimization can be shared between all these archive types.
>> In addition formats such as HDF5 and netCDF have been mentioned,
>> which can reuse the *same* serialization function to achieve optimal
>> performance.
>> There is nothing "optimistic" here since we have the actual
>> implementations, which show that code duplication can be avoided.

OK - I can really only comment on that which I've seen.

>>> Conclusions
>>> ===========
>>> a) The proposal suffers from "premature optimization".
>>> A large amount of design effort has been expended on
>>> areas which are likely not the source of observed
>>> performance bottlenecks.
>> As Dave pointed out one main reason for a save_array/load_array or
>> save_sequence/load_sequence hook is to utilize existing APIs for
>> serialization (including message pasing) that provide optimized
>> functions for arrays of contiguous data. Examples include MPI, PVM,
>> XDR, HDF5. There is a well established reason why all these libraries
>> have special functions for arrays of contiguous data, because they
>> all observed the same bottlenecks. These bottlenecks are well known
>> for decades in high performance computing, and have caused all these
>> APIs to include special support for contiguous arrays of data.

I admit I'm skeptical of the benefits, but I've not disputed that someone
be able to do this without a problem. The difference lies in where
the implementation should be placed.

>>> b) The proposal suffers from "over generalizaton". The
>>> attempt to generalize results in a much more complex
>>> system. Such a system will result in a net loss of
>>> conceptual integregrity and implementation transparancey.
>>> The claim that this generalization will actually result in a
>>> reduction of code is not convincing.
>> I'm confused by your statement. Actually the implementations of fast
>> binary archives, MPI archives and XDR archives do share common
>> serialization functions, and this does indeed result in code
>> reduction and avoids code duplication.

Upon reflection - I think I would prefer the term "premature
I concede that's speculation on my part. It seems a lot of effort has
been invested to avoid the MxN problem. My own experiments with
bitwise_array_archive_adaptor have failed to convince me that the
library needs more API to deal with this problem. Shortly, I will
be uploading some code which perhaps will make my reasons
for this belief more obvious.

>>> c) by re-implementing a currently existing and used
>>> archive, it risks creating a maintainence headache
>>> for no real benefit.
>> To avoid any such potential problems Dave proposed to add a new
>> archive in an array sub namespace.

As I said - that's not how I understood it.

>> I guess that alleviates your
>> concerns? Also, a 10x speedup might not be a benefit for you and your
>> applications but as you can see from postings here, it is a concern
>> for many others.

LOL - No one has ever disputed the utility of a 10x speed up. The question
is how best to achieve it without creating a ripple of side effects.

>>> Suggestions
>>> ===========
>>> a) Do more work in finding the speed bottlenecks. Run
>>> a profiler. Make a buffer based non-stream based archive
>>> and re-run your tests.
>> I have attached a benchmark for such an archive class and ran
>> benchmarks for std::vector<char> serialization. Here are the numbers
>> (using gcc-4 on a Powerbook G4):
>> Time using serialization library: 13.37
>> Time using direct calls to save in a loop: 13.12
>> Time using direct call to save_array: 0.4

>> In this case the buffer had size 0 at first and needed to be resized
>> during the insertions. Here are numbers for the case where enough
>> memory has been reserved():
>> Time using serialization library: 12.61
>> Time using direct calls to save in a loop: 12.31
>> Time using direct call to save_array: 0.35
>> And here are the numbers for std::vector<double>, sing a vector of
>> 1/8-th the size:
>> Time using serialization library: 1.95
>> Time using direct calls to save in a loop: 1.93
>> Time using direct call to save_array: 0.37
>> Since there are fewer calls for these larger types it looks slightly
>> better, but even now there is a more than 5x difference in this
>> benchmark.
>> As you can see the overhead of the serialization library (less than
>> 2%) is insignificant compared to the cost of doing lots of individual
>> insertion operations into the buffer instead of one big one. The
>> bottleneck is thus clearly the many calls to save() instead of a
>> single call to save_array().

Well, this is interesting data. the call to save() resolves inline to
a call to std::vector get element and stuffing the value into the buffer.
I wonder how much of this in std::vector and how much is in
the save to the buffer?.

And it does diminish my skepticism about how much benefit
the array serialization would be in at least these specific cases.

So, I'll concede that this will be a useful facility for a significant
group of users. Now we can focus on how to implement it
with the minimal collateral damage.

>>> b) Make your MPI, XDR and whatever archives. Determine
>>> how much opportunity for code sharing is really
>>> available.
>> This has been done and is the reason for the proposal to introduce
>> something like the save_array/load_array functions. I have coded an
>> XDR and two different types of MPI archives (one using a buffer,
>> another not using a buffer). A single serialization function for
>> std::valarray, using the load_array hook, to use the optimized APIs
>> in MPI and XDR, as well as a faster binary archive, and the same is
>> true for other types.

>>> c) If you still believe your proposal has merit, make
>>> your own "optimized binary archive". Don't derive
>>> from binary_archive but rather from common_?archive
>>> or perhaps basic_binary_archive. In this way you
>>> will have a totally free hand and won't have to
>>> achieve consensus with the rest of us which will
>>> save us all a huge amount of time.
>> I'm confused. I realize that one should not derive from
>> binary_iarchive, but why should one not derive from
>> binary_iarchive_impl?

What I meant is if you don't change the current binary_i/oarchive
implementation you won't have to worry about backward compatibility
with any existing archives. I (mis?)understood the proposal to include
adjustments to the current implementation so that it could be
derived from.

>> Also, following Dave's proposal none of your archives is touched, but
>> instead additional faster ones are provided.

This wasn't clear to me from my reading of the proposal.

Robert Ramey

Boost list run by bdawes at, gregod at, cpdaniel at, john at