Boost logo

Boost :

Subject: Re: [boost] Parallel Boost Algorithm
From: Nick Edmonds (ngedmond_at_[hidden])
Date: 2010-05-01 21:14:39


On Apr 30, 2010, at 11:49 AM, anil khadka wrote:

> I am pretty new to both boost and parallel boost.
>
> I am using Brandes Centrality algorithm, but the performance really doesn't meet my expectation.
>
> If I used Java, which is only using 1 CPU core would complete it in around 30 hrs.
> The parallel version which use 8 CPU cores tooks 17 hours!!
>
> I don't know why there is just around 13 hours gain if i use 8 cores against 1, unless there are lots and lots of communication between each cores.
> I am using just 1 machine (not a cluster), right now.
>
> Graph size is Node 80K, edges 120K
>
> I am reading edges from files. I used distributed compressed_sparse_row to represent graph.
> Right now, it seems like all the cpu cores are reading files on its own, I am assuming each CPU core will takes its graph part automatically (distributed algorithm do it ?).
>
> Note: I downloaded the parallel version from: http://osl.iu.edu/research/pbgl/software/
>
> Any help/suggestions would be really appreciated.

Which version of the parallel Betweenness Centrality algorithm are you using? There's a non-distributed version that replicates the graph as opposed to distributing it that should be much faster for small graphs like the one you describe. This algorithm should scale much closer to linearly than the distributed version. I've used the non-distributed version to solve instances up to 2^20 vertices and 2^23 edges if I remember right (it might be 2^19/2^22). The non-distributed version parallelizes the outer loop over source vertices, while the distributed version computes one shortest-path tree at a time using a distributed SSSP algorithm.

Is your graph weighted or unweighted? This determines whether the algorithm is O(n^2) or O(n^2 log n). The scaling you're seeing is about what I'd expect for a very small graph instance distributed 8-ways, there simply isn't enough work to hide the latency in the MPI layer. You can try adjusting the size of message buffers in the process group (smaller buffers mean less coalescing, lower latency, and potentially more computation/communication overlap). Check out the --batch-buffer-size parameter to libs/graph_parallel/test/algorithm_performance for an example of how to specify a message buffer size to the process group ctor and then pass the process group to a graph instance.

The distributed-memory implementation in PBGL is sub-optimal for shared memory because accessing data stored on other processes goes through the entire MPI layer as opposed to simply reducing to a memory reference. The non-distributed BC algorithm only performs communication to perform property map reductions so this is much less of a problem in that case though you still replicate the graph once per process. In either case the distributed data structures impose some overhead, as does the serialization performed in Boost.MPI (there's a BOOST_MPI_HOMOGENOUS (IIRC) compile time flag that disables some serialization though a 'null-serialization' option that doesn't perform any serialization would be a very useful performance optimization). I'm in the process of rewriting the PBGL to leverage shared-memory parallelism more effectively, though it's unlikely it will every be an optimal shared-memory-only library (because of the overhead imposed by the distributed addressing scheme etc.). There's definitely room for a "Multicore BGL" library, though my lab doesn't have the resources/interest to do it at the moment.

All the graph generators in the PBGL assume that every node reads the entire graph's edge set and filters out non-local edges. It's possible to write scalable graph generators which don't do this (the indexed dimacs reader is one example), but this generally requires some sort of preprocessing. Alternately you can have each process read a disjoint portion of the edge set and use the in-place CSR ctors to build the graph. This last approach is how we fit graphs that take a very large fraction of main memory to represent.

The PBGL svn trunk at http://osl.iu.edu/research/pbgl/software should be a proper superset of the 1.42 Boost release, the 0.7 tarball should be almost exactly equivalent. New features are pushed to Boost in a coarse-grained fashion when they're tested/stable.

Let me know if you have any more questions.

Thanks,
Nick


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk