Boost logo

Boost :

Subject: Re: [boost] Parallel Boost Algorithm
From: Nick Edmonds (ngedmond_at_[hidden])
Date: 2010-05-10 18:26:47


On May 4, 2010, at 3:11 PM, anil khadka wrote:

> Thanks for the prompt response.
>
> I haven't tried the non-distributed (name is confusing though) version. I am using regular version with unweighted graph (I am not passing any weight parameter, as default is unweighted).

Ok, the unweighted version is significantly faster than the weighted version. Yep, the algorithm got written/named during a paper deadline... at the last minute... I was not at my most creative. I'm not really sure what a more-accurate name would be though. I'm open to ideas for better names.

>
> I used "0.7.0 beta" (parallel-bgl-0.7.0.tar.gz).
> I think i should try the non-distributed version as 80K quite smaller than 2^20 nodes. (BTW, I also have a graph with more than millions nodes too, but that comes later.)
>

I don't think there have been any changes since the 0.7 tarball, though there is a bug fix in the distributed CSR graph for compatibility with changes in Boost 1.42/1.43. That bug fix is in both the PBGL trunk and Boost trunk.

> You said that we can have each process read a disjoint portion of the edge set and use the in-place CSR ctors to build the graph, Are there any examples in the source code for this??
>

If you check libs/graph_parallel/test/distributed_csr_ctor_test.cpp there's a simple example of each ctor.

> One more thing, I also want to calculate the Closeness Centrality, and i do not want to re-processed the whole graph, as most of the calculation(like shortest path) is same as betweenness. Can't we calculate Closeness centrality on the same pass?

A cursory glance at the definition of Closeness Centrality indicates to me that this should be simple to compute at the same time as Betweenness Centrality. You would just have to modify the algorithm to propagate the distances to each reachable vertex back to the source vertex for each SSSP iteration. So long as you accumulate the sum of the distances on the process that owns the source vertex the naive algorithm should be scalable.

>
> I'll also tune the parameters(like message-buffer), and post the result here.

I hope your tuning is going well. Tuning the message buffer size should have much less effect on the non_distributed algorithm than on the one that uses a distributed graph.

I've observed that the performance of the distributed algorithm is inversely proportional to the number of equal-length paths. Things scale quite nicely when there are relatively few equal-length paths but as the number of equal-length paths increases inverse scaling occurs with progressively smaller numbers of processors. The performance of the non-distributed version pretty much always scales linearly as you would expect.

Cheers,
Nick

> Metanil.
>
>
>
>
>
> ________________________________
> From: Nick Edmonds <ngedmond_at_[hidden]>
> To: boost_at_[hidden]
> Sent: Sat, May 1, 2010 9:14:39 PM
> Subject: Re: [boost] Parallel Boost Algorithm
>
> On Apr 30, 2010, at 11:49 AM, anil khadka wrote:
>
>> I am pretty new to both boost and parallel boost.
>>
>> I am using Brandes Centrality algorithm, but the performance really doesn't meet my expectation.
>>
>> If I used Java, which is only using 1 CPU core would complete it in around 30 hrs.
>> The parallel version which use 8 CPU cores tooks 17 hours!!
>>
>> I don't know why there is just around 13 hours gain if i use 8 cores against 1, unless there are lots and lots of communication between each cores.
>> I am using just 1 machine (not a cluster), right now.
>>
>> Graph size is Node 80K, edges 120K
>>
>> I am reading edges from files. I used distributed compressed_sparse_row to represent graph.
>> Right now, it seems like all the cpu cores are reading files on its own, I am assuming each CPU core will takes its graph part automatically (distributed algorithm do it ?).
>>
>> Note: I downloaded the parallel version from: http://osl.iu.edu/research/pbgl/software/
>>
>> Any help/suggestions would be really appreciated.
>
> Which version of the parallel Betweenness Centrality algorithm are you using? There's a non-distributed version that replicates the graph as opposed to distributing it that should be much faster for small graphs like the one you describe. This algorithm should scale much closer to linearly than the distributed version. I've used the non-distributed version to solve instances up to 2^20 vertices and 2^23 edges if I remember right (it might be 2^19/2^22). The non-distributed version parallelizes the outer loop over source vertices, while the distributed version computes one shortest-path tree at a time using a distributed SSSP algorithm.
>
> Is your graph weighted or unweighted? This determines whether the algorithm is O(n^2) or O(n^2 log n). The scaling you're seeing is about what I'd expect for a very small graph instance distributed 8-ways, there simply isn't enough work to hide the latency in the MPI layer. You can try adjusting the size of message buffers in the process group (smaller buffers mean less coalescing, lower latency, and potentially more computation/communication overlap). Check out the --batch-buffer-size parameter to libs/graph_parallel/test/algorithm_performance for an example of how to specify a message buffer size to the process group ctor and then pass the process group to a graph instance.
>
> The distributed-memory implementation in PBGL is sub-optimal for shared memory because accessing data stored on other processes goes through the entire MPI layer as opposed to simply reducing to a memory reference. The non-distributed BC algorithm only performs communication to perform property map reductions so this is much less of a problem in that case though you still replicate the graph once per process. In either case the distributed data structures impose some overhead, as does the serialization performed in Boost.MPI (there's a BOOST_MPI_HOMOGENOUS (IIRC) compile time flag that disables some serialization though a 'null-serialization' option that doesn't perform any serialization would be a very useful performance optimization). I'm in the process of rewriting the PBGL to leverage shared-memory parallelism more effectively, though it's unlikely it will every be an optimal shared-memory-only library (because of the overhead imposed by the
> distributed addressing sche
> me etc.). There's definitely room for a "Multicore BGL" library, though my lab doesn't have the resources/interest to do it at the moment.
>
> All the graph generators in the PBGL assume that every node reads the entire graph's edge set and filters out non-local edges. It's possible to write scalable graph generators which don't do this (the indexed dimacs reader is one example), but this generally requires some sort of preprocessing. Alternately you can have each process read a disjoint portion of the edge set and use the in-place CSR ctors to build the graph. This last approach is how we fit graphs that take a very large fraction of main memory to represent.
>
> The PBGL svn trunk at http://osl.iu.edu/research/pbgl/software should be a proper superset of the 1.42 Boost release, the 0.7 tarball should be almost exactly equivalent. New features are pushed to Boost in a coarse-grained fashion when they're tested/stable.
>
> Let me know if you have any more questions.
>
> Thanks,
> Nick
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
>
>
>
>
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk