Boost logo

Boost :

From: Douglas Gregor (doug.gregor_at_[hidden])
Date: 2006-09-16 10:50:31


On Sep 16, 2006, at 2:20 AM, K. Noel Belcourt wrote:
> I was able to review the implementations of the broadcast, gather,
> scatter, and reduce functions which all call through to the
> corresponding MPI_ function. This is perfectly reasonable. But
> these, and other, functions can be implemented much more efficiently
> using sends and recvs. These less efficient implementations may
> adversely impact adoption of Boost.MPI by the larger high performance
> computing community. I would like the authors to consider these more
> efficient algorithms at some point in the future.

Performance is extremely important to us, so I want to make sure I
understand exactly what you mean.

One of the biggest assumptions we make, particularly with
collectives, is that using the most specialized MPI call gives the
best performance. So if the user sums up integers with a reduce()
call, we should call MPI_Reduce(..., MPI_INT, MPI_SUM, ...) to get
the best performance, because it has probably been optimized by the
MPI vendor, both in general (i.e., a better algorithm than ours) and
for their specific hardware. Of course, if the underlying MPI has a
poorly-optimized implementation of MPI_Reduce, it is conceivable that
Boost.MPI's simple tree-based implementation could perform better. I
haven't actually run into this problem yet, but it clearly can
happen: I've peeked at one or two MPI implementations and have been
appalled at how naively some of the collectives are implemented. I
think this is the point you're making: it might be better not to
specialize down to, e.g., the MPI_Reduce call, depending on the
underlying MPI implementation.

There is at least one easy way to address this issue. We could
introduce a set of global, compile-time flags that state whether the
underlying implementation of a given collective is better than ours.
These flags would vary depending on the underlying MPI. For instance,
maybe Open MPI has a fast broadcast implementation, so we would have

        typedef mpl::true_ has_fast_bcast;

whereas LAM/MPI might not have a fast broadcast:

        typedef mpl::false_ has_fast_bcast;

These flags would be queried in the algorithm dispatch logic:

template<typename T>
void broadcast(const communicator& comm, T& value, int root = 0)
{
   detail::broadcast_impl(comm, value, root,
mpl::and_<is_mpi_datatype<T>, has_fast_bcast>());
}

The only tedious part of implementing this is determining which
collectives are well-optimized in all of the common MPI
implementations, although we could certainly assume the best and
tweak the configuration as our understanding evolves.

        Doug


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk