Boost logo

Boost :

From: Matthias Troyer (troyer_at_[hidden])
Date: 2005-11-23 08:50:50


On Nov 23, 2005, at 10:53 AM, Peter Dimov wrote:

> Matthias Troyer wrote:
>
>> Indeed this sounds like a lot of work and that's why this mechanism
>> for message passing was rarely used in the past. The hard part is to
>> manually build up the custom MPI datatype, i.e. to inform MPI about
>> what the offsets and types of the various data members in a struct
>> are.
>>
>> This is where the serialization library fits in and makes the task
>> extraordinarily easy. Saving a data member with such an MPI archive
>> will register its address, type (as well as the number of identical
>> consecutive elements in an array) with the MPI library. Thus the
>> serialization library does all the hard work already!
>
> I still don't see the 10x speedup in the subject.

The 10x speedup was reported at the start of the thread several weeks
ago for benchmarks, comparing the writing of large arrays and vectors
using the serialization library as compared to directly writing them
into a stream and memory buffer. The MPI case was brought up to show
that the serialization library can be used more efficiently also
there if we have a save_array and load_array functionality. Please
keep this in mind when reading my replies. We are discussing MPI here
as another example, next to binary archives, where save_array and
load_array optimizations will be important.

> For a X[], the two
> approaches are:
>
> 1. for each x in X[], "serialize" into an MPI descriptor
> 2. serialize X[0] into an MPI descriptor, construct an array
> descriptor from
> it

Correct.

> Conceptual issues with (2) aside (the external format of X is
> determined by
> X itself and you have no idea whether the structure of X[0] also
> describes
> X[1]),

Of course you can use (2) only for contiguous arrays of the same
type, and not for any pointer member or polymorphic members. It will
work for any type that is layout-compatible with a POD and contains
no pointers or unions. Examples are std::complex<T>, tuples of
fundamental types, or any struct having only fundamental types as
members. For these types the memory layout of X[0] is the same as of X
[1]. The only case where this might not apply would be union or
pointer members, and in that case the optimization can, of course,
not be applied.

> I'm not sure that there will be such a major speedup compared to the
> naive (1).

Oh yes, there can be a huge difference. Let me just give a few reasons:

1) in the applications we talk about we have to regularly send huge
contiguous arrays of numbers (stored e.g. in a matrix, vector,
valarray or multi_array) over the network. The typical size is 100
million numbers upwards. I'll stick to 100 million as a typical
number in the following. Storing these 100 million numbers already
takes up 800 MByte, and nearly fills the memory of the machine, and
this causes problems:

   a) copying these numbers into a buffer using the serialization
library needs another 800 MB of memory that might not be available

   b) creating MPI data types for each member separately mean storing
at least 12 bytes (4 bytes each for the address, type and count), for
a total of 1200 MBytes, instead of just 12 bytes. Again we will have
a memory problem

But the main issue is speed. Serializing 100 million numbers one by
one, requires 100 million access to the network interface, while
serializing the whole block at one just causes a single call, and the
rest will be done by the hardware. The reason why we cannot
afford this overhead is that actually on modern high performance
networks

   ** the network bandwidth is the same as the memory bandwidth **

and that, even if all things could be perfectly inlined and
optimized, the time to read the MPI datatype for each element when
using (1) will completely overwhelm the time actually required to
send the message using (2).

To substantiate my claim (**) above, I want to mention a few numbers:

  * the "Black Widow" network of the Cray X1 series has a network
bandwidth of 55 GByte/second!
   * the "Red Storm" network of the Cray XT3 Opteron clusters, uses
one hypertransport channel for the network acces, and another one for
memory access, and thus the bandwidth here is the same as the memory
bandwidth
   * the IBM Blue Gene/L has a similarly fast network with 4.2 GByte/
second network bandwidth per node
   * even going to cheaper commodity hardware, like Quadrics
interconnects, 1 GByte/second is common nowadays.

I am sure you will understand that to keep up with these network
data transfer rates we cannot afford to perform additional
operations, such as accessing the network interface once per double
to read the address, even aside from the memory issue raised above.

I hope this clarifies why the approach (2) should be taken whenever
possible.

> Robert's point also deserves attention; a portable binary archive that
> writes directly into a socket eliminates the MPI middleman and will
> probably
> achieve a similar performance as your two-pass MPI approach.

This is indeed a nice idea and would remove the need for MPI in some
applications on standard-speed TCP/IP based networks, such as
Ethernet, but it is not a general solution for a number of reasons:

1. Sockets do not even exist on most dedicated network hardware, but
MPI is still available since it is the standard API for message
passing on parallel computers. Even if sockets are still available,
they just add additional layers of (expensive) function calls
between the network hardware and the serialization library, while the
vendor-provided MPI implementations usually access the network
hardware directly.

2. MPI is much more than a point-to-point communication protocol
built on top of sockets. It is actually a standardized API for all
high performance network hardware. In addition to point-to-point
communication (using synchronous, asynchronous, buffered and one-way
communication) it also provides a large number of global operations,
such as broadcasts, reductions, gather and scatter. These work with
log(N) complexity on N nodes and often use special network hardware
dedicated to the task (such as on an IBM Blue Gene/L machine). All
these operations can take advantage of the MPI datatype mechanism.

3. The MPI implementations can determine at runtime whether the
transformation to a portable binary archive is actually needed or
whether just the bits can be streamed, and it will do this
transparently, hiding it from the user.

> It also
> supports versioned non-PODs and other nontrivial types. As an
> example, I
> have a type which is saved as
>
> save( x ):
>
> save( x.anim.name() ); // std::string
>
> and loaded as
>
> load( x ):
>
> string tmp;
> load( tmp );
> x.set_animation( tmp );
>
> Not everything is field-based.

Indeed, and for such types the optimization would not apply.


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk