Boost :

Date view	Thread view	Subject view	Author view

From: Ian McCulloch (ianmcc_at_[hidden])
Date: 2005-11-25 11:46:10

Next message: John Maddock: "Re: [boost] Conforming tuples?"
Previous message: Robert Ramey: "Re: [boost] [serialization] fast array serialization (10x speedup)"
In reply to: Robert Ramey: "Re: [boost] [serialization] fast array serialization (10x speedup)"
Next in thread: David Abrahams: "Re: [boost] [serialization] fast array serialization (10x speedup)"
Reply: David Abrahams: "Re: [boost] [serialization] fast array serialization (10x speedup)"

Despite saying I would no longer participate in this thread, I re-read
Robert's recent list of queries and was struck by the thought that most, if
not all, is based on accidental mis-readings of some previous posts,
combined with some completely understandable misunderstandings about
performance issues that are well-known in the HPC domain (since clarified
elsewhere by some benchmarks). So, in the spirit of perhaps rescuing the
situation, I reply again:

Robert Ramey wrote:

> To summarize how we arrived here.
> =================================
>
> a) Mattias augmented binary_?archive to replace
> element by serialization of primitive types
> with save/load binary for C++ arrays, std::vectors
> and boost::val_array. This resulted in a 10 x speed
> up of the serialization process.

Right. If you see some function in the profile that is called 10,000 times
and is a bottleneck, which is better? Optimize that function a bit and get
a factor 2 speedup (see
http://lists.boost.org/Archives/boost/2005/11/97156.php), or change the
calling sequence so that that function is only called once, and get a
factor 10 speedup?

>
> b) From this it has been concluded that binary archives
> should be enhanced to provide this facility automatically
> and transparently to the user.

Right. From the point of view of a user, this is completely analagous to,
say, a standard library implementor optimizing std::copy(container.begin(),
container.end(), ostream_iterator<T>(stream)), for example. If he/she did
this, would the user be interested in *disabling* that optimization? What
would be the point?

But note, 'should' here is really 'could'. The current proposal from David
Abrahams explicitly does *not* modify any of the existing archives.

>
> c) The structure of the library an the documentation
> suggest that the convenient way to do this is to
> specify an overload for each combination of archive/type
> which can benefit from special treatment.
>

As I understand it (someone please correct me if I got this wrong), the core
idea of the new proposal (from
http://lists.boost.org/Archives/boost/2005/11/96923.php and followups) is
to provide a single point of customization, that *serialization function
authors* can utilize to serialize an array in one call. Of course making
use of this hook is optional, but since it is also a good convenience
function (it saves a couple of lines by avoiding having to manually code a
loop), there isn't really any point to not use it.

Note that a significant point of the proposal
http://lists.boost.org/Archives/boost/2005/11/96923.php is that the
possibility exists that this 'hook' is not part of the serialization
library proper, but the point is *it must be globally accessible* and not
specific to a particular archive.

>From http://lists.boost.org/Archives/boost/2005/11/96923.php :

David Abrahams wrote:
| In an upcoming message I'm planning to start by describing the least
| intrusive design that could possibly work -- one that makes no changes
| at all to the existing serialization library. It's not a bad design,
| but it has a few drawbacks that I'd like to discuss. At that point it
| should become clear what I meant about "hijacking" the serialization
| library. Finally, I'll describe the smallest set of changes to the
| existing serialization library that would be needed to address those
| drawbacks.
|
| Just to state the obvious, I hope to convince you that those few
| changes are worth making. Of course, as the maintainer of
| Boost.Serialization, it's completely up to you whether to do so.

Moving on to your next point:

> d) The above (c) is deemed inconvenient because it has
> been supposed that many archive classes will share
> a common implementation of load/save array. This would
> suggest that using (c) above, though simple and straight
> forward, will result in code repetition.

I think this must have resulted from a misunderstanding.

1. As far as I can tell, the proposal is completely consistent with (c)
above. The question of how the save/load_array is actually dispatched to
the archive is a detail that has relevance *only to archive implementors*.
Obviously different dispatch mechanisms have different tradeoffs, as
discussed elsewhere on this thread, but none of this should be visible for
people who are not archive authors.

2. I don't recall seeing anyone suggest that different archive types would
be able to share a common implementation of load/save array (except for the
trivial case where an archive has no array support and instead uses a
default implementation that serializes each element in a loop!). Can you
cite where you saw this, so that someone can clarify this?

3. Without a single 'hook' to use when writing serialization functions, the
only alternative is to specialize each serialization function that can make
use of array optimizations *separately for each archive*. For example, an
MPI archive might provide a helper function Save_MPI_Datatype(void* buffer,
size_t count, MPI_Datatype type), and require serialization function
authors' to call this member. An XDR archive might provide a function
SaveArray(T* Begin, T* End). With no cooperation between the authors of
the archive to use a common function for array serialization, the poor user
will have to write two sets of serialization functions, one that calls
Save_MPI_Datatype() and one that calls SaveArray(). I hope it is obvious
to you that this situation is completely untenable!

>
> e) So it has been proposed binary_iarchive be re-implemented
> in the following way
>
> iarchive - containg default implementation of load_array
> binary_iarchive - ? presumablu contains implementaion of load_array
> in terms of currently defined load_binary

This was part of the original (pre Nov 19) proposal and is no longer
relevant.

>
> Its not clear whether all archives would be modified in
> this way or just binary_iarchive.

? What specific implementation of load_array would you suggest, for the
other existing archive types?

>
> The idea is that each type which can benefit from load_array
> can call it and the version of load_array corresponding
> to that particular archive will be invoked. This will
> require
>
> i) the serialization functino for types which can benefit
> from some load_array function would call this.

Right. The set of types that can benefit is very dependent on the archive,
but in general it is anything that looks like a container.

> ii) Only a small number of load_array functions would have
> ot be written for each archive. So the number of special functions
> to be written would be One for each type which might use
> load_array and "one" for each archive.

Right.

>
> Problems with the Design
> ========================
>
> a) It doesn't address the root cause of "slow" performance
> of binary archives.
>
> The main problem is that it doesn't address the cause of the
> 10 X speed up. Its a classic case of premature optimization.
> The 10x speed up was based on test program. For a C++
> array, the test boils down to replacing 10,000 invocations
> for stream write(..) with one invocation of stream write
> 10,000 times longer.

I think this issue has been covered elsewhere, including benchmarks. If you
can see any remaining problems with the claim that the root cause of the
poor performance is the multiple calls to the buffer write(), and even
using a specialized buffer *does not significantly help*, then please say
so.

[snip]

> I would be surprised to see if the 10x speed up still exists
> with this "buffered_archive".

As shown elsewhere, you are surprised.

>
> note that for the intended application - mpi communication -
> some archive which doesn't use stream i/o have to be
> created anyway.

Right. In this case, the write() to the stream is replaced by either a
immediate call to MPI_Send(data, size, ...), or by a mechanism to construct
a derived MPI_Datatype that will essentially record a map of which memory
locations need to be sent and pass this directly to MPI. In both cases,
the cost of not utilizing array operations would not be 10x, but more like
1000x or more.

>
> b) re-implemenation of binary_archive in such a way so as not
> to break existing archives would be an error prone process.
> The switching between new and old method "should" result
> in exactly the same byte sequence. But it could easily
> occur that a small subtle change might render archives
> create under the previous binary_archive unreadable.

Again, this is not part of the revised proposal. But is the binary
serialization really *that* fragile, that this is a significant concern?
It suggests that binary archives using the boost serialization lib would be
really hard to write correctly (much harder than my experience suggests)!

>
> c) The premise that one will save a lot of coding
> (see d) above) compared to to the current method
> of overloading based on the pair of archive/type
> is overyly optimistic. This is explained in Peter Dimov's
> post here:
>
> http://lists.boost.org/Archives/boost/2005/11/97089.php
>

I think this is based on the same misunderstanding as the first point (d)
above? Anyway, what do you mean by "current method" ? Can you describe
how I should write my single serialization function (hopefully just one of
them!) for my_matrix_type using the "curent method" ?

> I'm aware this is speculative. I haven't investigated
> MPI, XDR and other's enough to know how much code
> sharing is possible. It does seem that there
> will be no sharing with the the "fast binary archive"
> of the previous submission. From the short descriptions
> of MPI I've seen on this list along with my cursory
> investigation of XDR, I'm doubtful that there is
> any sharing there either.

Again, I think this is based on the same misunderstanding. Archives will
typically not share implementations of array processing.

>
> Conclusions
> ===========
> a) The proposal suffers from "premature optimization".
> A large amount of design effort has been expended on
> areas which are likely not the source of observed
> performance bottlenecks.

Not true, as already explained here and elsewhere.

>
> b) The proposal suffers from "over generalizaton". The
> attempt to generalize results in a much more complex
> system. Such a system will result in a net loss of
> conceptual integregrity and implementation transparancey.
> The claim that this generalization will actually result in a
> reduction of code is not convincing.

I think this is based on the same misunderstanding as the first point (d)
above?

>
> c) by re-implementing a currently existing and used
> archive, it risks creating a maintainence headache
> for no real benefit.

Again, this is not part of the current proposal. This issue is not very
important because most of the interesting uses for array optimizations are
not based on archives currently in the serialization lib. Whether existing
archives make use of array optimizations is a side-issue that can be
discussed if/when the necessary array hooks actually exist.

>
> Suggestions
> ===========
>
> a) Do more work in finding the speed bottlenecks. Run
> a profiler. Make a buffer based non-stream based archive
> and re-run your tests.

Done, see
http://lists.boost.org/Archives/boost/2005/11/97166.php
http://lists.boost.org/Archives/boost/2005/11/97156.php

For the people here who have experience in this problem domain, the results
are completely obvious. In hindsight it is equally obvious that someone
from a different background would not be aware of this! Sorry.
>
> b) Make your MPI, XDR and whatever archives. Determine
> how much opportunity for code sharing is really
> available.

I think Matthias already has a prototype MPI as well as XDR archive? But
this point is again based on the misunderstanding of point (d) above. I
doubt there is any possibility for code-sharing between the load/save_array
functions between MPI and XDR *at all*, the point is that I, as a user,
want to be able to write just a single serialization function for
my_matrix_type that will make use of whatever array optimizations the
archive I am using can provide.

>
> c) If you still believe your proposal has merit, make
> your own "optimized binary archive". Don't derive
> from binary_archive but rather from common_?archive
> or perhaps basic_binary_archive. In this way you
> will have a totally free hand and won't have to
> achieve consensus with the rest of us which will
> save us all a huge amount of time.

I don't understand what this means.

Regards,
Ian

Next message: John Maddock: "Re: [boost] Conforming tuples?"
Previous message: Robert Ramey: "Re: [boost] [serialization] fast array serialization (10x speedup)"
In reply to: Robert Ramey: "Re: [boost] [serialization] fast array serialization (10x speedup)"
Next in thread: David Abrahams: "Re: [boost] [serialization] fast array serialization (10x speedup)"
Reply: David Abrahams: "Re: [boost] [serialization] fast array serialization (10x speedup)"

Date view	Thread view	Subject view	Author view

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk