From: K. Noel Belcourt (kbelco_at_[hidden])
Date: 2006-09-18 12:19:17
On Sep 16, 2006, at 8:50 AM, Douglas Gregor wrote:
> On Sep 16, 2006, at 2:20 AM, K. Noel Belcourt wrote:
>> I was able to review the implementations of the broadcast, gather,
>> scatter, and reduce functions which all call through to the
>> corresponding MPI_ function. This is perfectly reasonable. But
>> these, and other, functions can be implemented much more efficiently
>> using sends and recvs. These less efficient implementations may
>> adversely impact adoption of Boost.MPI by the larger high performance
>> computing community. I would like the authors to consider these more
>> efficient algorithms at some point in the future.
> Performance is extremely important to us, so I want to make sure I
> understand exactly what you mean.
> One of the biggest assumptions we make, particularly with
> collectives, is that using the most specialized MPI call gives the
> best performance. So if the user sums up integers with a reduce()
> call, we should call MPI_Reduce(..., MPI_INT, MPI_SUM, ...) to get
> the best performance, because it has probably been optimized by the
> MPI vendor, both in general (i.e., a better algorithm than ours) and
> for their specific hardware. Of course, if the underlying MPI has a
> poorly-optimized implementation of MPI_Reduce, it is conceivable that
> Boost.MPI's simple tree-based implementation could perform better. I
> haven't actually run into this problem yet, but it clearly can
> happen: I've peeked at one or two MPI implementations and have been
> appalled at how naively some of the collectives are implemented. I
> think this is the point you're making: it might be better not to
> specialize down to, e.g., the MPI_Reduce call, depending on the
> underlying MPI implementation.
> There is at least one easy way to address this issue. We could
> introduce a set of global, compile-time flags that state whether the
> underlying implementation of a given collective is better than ours.
> These flags would vary depending on the underlying MPI. For instance,
> maybe Open MPI has a fast broadcast implementation, so we would have
> typedef mpl::true_ has_fast_bcast;
> whereas LAM/MPI might not have a fast broadcast:
> typedef mpl::false_ has_fast_bcast;
> These flags would be queried in the algorithm dispatch logic:
> template<typename T>
> void broadcast(const communicator& comm, T& value, int root = 0)
> detail::broadcast_impl(comm, value, root,
> mpl::and_<is_mpi_datatype<T>, has_fast_bcast>());
> The only tedious part of implementing this is determining which
> collectives are well-optimized in all of the common MPI
> implementations, although we could certainly assume the best and
> tweak the configuration as our understanding evolves.
I think this is the best option, assume the native MPI
implementations are efficient and then flip these flags as we find
evidence to the contrary. I like this solution, very clean, no
runtime overhead, easy to configure.
I look forward to using your library.
-- Noel Belcourt
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk