Boost logo

Boost :

From: Matthias Troyer (troyer_at_[hidden])
Date: 2005-11-23 09:51:46


On Nov 23, 2005, at 3:11 PM, Peter Dimov wrote:

> Matthias Troyer wrote:
>
>> Oh yes, there can be a huge difference. Let me just give a few
>> reasons:
>>
>> 1) in the applications we talk about we have to regularly send huge
>> contiguous arrays of numbers (stored e.g. in a matrix, vector,
>> valarray or multi_array) over the network. The typical size is 100
>> million numbers upwards. I'll stick to 100 million as a typical
>> number in the following. Storing these 100 million numbers already
>> takes up 800 MByte, and nearly fills the memory of the machine, and
>> this causes problems:
>>
>> a) copying these numbers into a buffer using the serialization
>> library needs another 800 MB of memory that might not be available
>>
>> b) creating MPI data types for each member separately mean storing
>> at least 12 bytes (4 bytes each for the address, type and count), for
>> a total of 1200 MBytes, instead of just 12 bytes. Again we will have
>> a memory problem
>>
>> But the main issue is speed. Serializing 100 million numbers one by
>> one, requires 100 million access to the network interface, while
>> serializing the whole block at one just causes a single call, and the
>> rest will be done by the hardware. The reason why we cannot
>> afford this overhead is that actually on modern high performance
>> networks
>>
>> ** the network bandwidth is the same as the memory bandwidth **
>
> This makes sense, thank you. I just want to note that contiguous
> arrays of
> double are handled equally well by either approach under
> discussion; an
> mpi_archive will obviously include an overload for double[].

Yes, but only if you have some save_array or save_sequence hook, or
alternatively the archive specifically provides an overload for
double[, std::vector<double>, std::valarray<double>,
boost:multi_array<double,N>, ...

> I was
> interested in the POD case. A large array of 3x3 matrices wrapped in
> matrix3x3 structs would probably be a good example that illustrates
> your
> point (c) above.

Indeed, the 3x3 matrix struct is a good example of why we want to use
this mechanism for more than just a fixed number of fundamental types.

> (a) and (b) can be avoided by issuing multiple MPI_Send
> calls for non-optimized sequence writes.

Yes, but that will hurt performance. The latency for a single
MPI_Send is still typically of the order of 0.5-5 microseconds even
on the fastest machines. You are right, though, in that if we cannot
use the fast mechanism and run into memory problems then indeed we
will need to split the message.

In the case of non-optimized sequence writes I might not use the MPI
data type mechanism though, but instead pack the message into a
buffer and send that buffer. For that one could either use the
MPI_Pack functions of MPI, or prepare a (portable) binary archive and
send that.

Matthias


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk