Subject: Re: [boost] [Boost-users] What's so cool about Boost.MPI?
From: Matthias Troyer (troyer_at_[hidden])
Date: 2010-11-12 01:45:06
On 12 Nov 2010, at 05:26, Sid Sacek wrote:
>> I also disagree with the statement that communication is faster than
>> computation. Even if you have 10 Gb/second networks into a compute node,
>> that corresponds only to about 150 M double precision floating point numbers.
>> Lets connect that to a node with a *single* quad core Nehalem CPU that
>> operates at actually measured sustained speeds of 74 Gflop, and you see that
>> the network is 500 times slower. Using 4 such CPUs on a quad-core node brings
>> the performance ratio to 2000! Even 10 times faster networks will only take
>> this down to a factor of 200.
>> Thus, in contrast to your statements networks are *not* an order or two magnitudes
>> faster than computers but two or three orders of magnitude slower than the compute
>> nodes. This will only get worse by an additional two orders of magnitude once we
>> add GPUs or other future accelerator chips to the nodes.
> Wow! You picked (cherry-picked) a very particular data type, and then performed a simple division between the FPU speed and the incoming data rate.
Not at all! I picked a typical high performance computing application for which MPI is designed. Such performance is achieved in real world applications, and the main bottleneck is usually the network.
> There are so many things that occur in the CPU before you can process network data. Like NIC interrupts to the drivers, driver interrupt processing, drivers signaling the running processes, task swaps, page faults, paging, cache flushes, cache updates, data transfers between buffers two to five times before it is processed, endian conversions, programs switching on key data bytes to call the proper procedures to process the data, the processed data then being used to trigger new actions, etc...
Actually, this is where MPI comes into play by getting rid of most of that overhead. WHat you describe above is exactly the reason why high performance computers use special network hardware, and why high performance MPI implementations are not built on top of TCP/IP. Using MPI we bypass most of the stuff you write above, and we typically know what kind of data to expect. Network latency and bandwidth are a worsening bottleneck for the scaling of parallel programs.
> A much better algorithm to use for calculating performance is to determine how many assembly instructions do you anticipate it will take to process a single byte of data. Data comes in infinite forms. Before the FPU gets a crack at the data, it has to pass though the CPU.
> Think about it... the data coming in from the network isn't being fed straight into your FPU hardware and the results being tossed away.
> My experience in "network data" processing is very different from yours.
Indeed. I talk about high performance computing which is apparently very different from your kind of applications. But in HPC we often are able to saturate the floating point units and memory and network performance are big problems. If you desire, can send you application codes that are totally limited by bandwidth, and where the CPU is idle 99% of the time.
Keep in mind that MPI (and Boost.MPI) is specifically designed for high performance computing applications, and we should look at such applications when talking about MPI. And there my estimates above are right, even if they are an upper bound for what we see. Even doing 10 or hundred times more operations on the data the CPU is still 5 times faster than the network. In addition the processing power of the compute nodes grows much faster than network speeds and this issue will only get worse.