
Boost : 
Subject: Re: [boost] Parallel Boost Algorithm
From: anil khadka (anil_khadka_at_[hidden])
Date: 20100504 15:11:39
Thanks for the prompt response.
I haven't tried the nondistributed (name is confusing though) version. I am using regular version with unweighted graph (I am not passing any weight parameter, as default is unweighted).
I used "0.7.0 beta" (parallelbgl0.7.0.tar.gz).
I think i should try the nondistributed version as 80K quite smaller than 2^20 nodes. (BTW, I also have a graph with more than millions nodes too, but that comes later.)
You said that we can have each process read a disjoint portion of the edge set and use the inplace CSR ctors to build the graph, Are there any examples in the source code for this??
One more thing, I also want to calculate the Closeness Centrality, and i do not want to reprocessed the whole graph, as most of the calculation(like shortest path) is same as betweenness. Can't we calculate Closeness centrality on the same pass?
I'll also tune the parameters(like messagebuffer), and post the result here.
Metanil.
________________________________
From: Nick Edmonds <ngedmond_at_[hidden]>
To: boost_at_[hidden]
Sent: Sat, May 1, 2010 9:14:39 PM
Subject: Re: [boost] Parallel Boost Algorithm
On Apr 30, 2010, at 11:49 AM, anil khadka wrote:
> I am pretty new to both boost and parallel boost.
>
> I am using Brandes Centrality algorithm, but the performance really doesn't meet my expectation.
>
> If I used Java, which is only using 1 CPU core would complete it in around 30 hrs.
> The parallel version which use 8 CPU cores tooks 17 hours!!
>
> I don't know why there is just around 13 hours gain if i use 8 cores against 1, unless there are lots and lots of communication between each cores.
> I am using just 1 machine (not a cluster), right now.
>
> Graph size is Node 80K, edges 120K
>
> I am reading edges from files. I used distributed compressed_sparse_row to represent graph.
> Right now, it seems like all the cpu cores are reading files on its own, I am assuming each CPU core will takes its graph part automatically (distributed algorithm do it ?).
>
> Note: I downloaded the parallel version from: http://osl.iu.edu/research/pbgl/software/
>
> Any help/suggestions would be really appreciated.
Which version of the parallel Betweenness Centrality algorithm are you using? There's a nondistributed version that replicates the graph as opposed to distributing it that should be much faster for small graphs like the one you describe. This algorithm should scale much closer to linearly than the distributed version. I've used the nondistributed version to solve instances up to 2^20 vertices and 2^23 edges if I remember right (it might be 2^19/2^22). The nondistributed version parallelizes the outer loop over source vertices, while the distributed version computes one shortestpath tree at a time using a distributed SSSP algorithm.
Is your graph weighted or unweighted? This determines whether the algorithm is O(n^2) or O(n^2 log n). The scaling you're seeing is about what I'd expect for a very small graph instance distributed 8ways, there simply isn't enough work to hide the latency in the MPI layer. You can try adjusting the size of message buffers in the process group (smaller buffers mean less coalescing, lower latency, and potentially more computation/communication overlap). Check out the batchbuffersize parameter to libs/graph_parallel/test/algorithm_performance for an example of how to specify a message buffer size to the process group ctor and then pass the process group to a graph instance.
The distributedmemory implementation in PBGL is suboptimal for shared memory because accessing data stored on other processes goes through the entire MPI layer as opposed to simply reducing to a memory reference. The nondistributed BC algorithm only performs communication to perform property map reductions so this is much less of a problem in that case though you still replicate the graph once per process. In either case the distributed data structures impose some overhead, as does the serialization performed in Boost.MPI (there's a BOOST_MPI_HOMOGENOUS (IIRC) compile time flag that disables some serialization though a 'nullserialization' option that doesn't perform any serialization would be a very useful performance optimization). I'm in the process of rewriting the PBGL to leverage sharedmemory parallelism more effectively, though it's unlikely it will every be an optimal sharedmemoryonly library (because of the overhead imposed by the
distributed addressing sche
me etc.). There's definitely room for a "Multicore BGL" library, though my lab doesn't have the resources/interest to do it at the moment.
All the graph generators in the PBGL assume that every node reads the entire graph's edge set and filters out nonlocal edges. It's possible to write scalable graph generators which don't do this (the indexed dimacs reader is one example), but this generally requires some sort of preprocessing. Alternately you can have each process read a disjoint portion of the edge set and use the inplace CSR ctors to build the graph. This last approach is how we fit graphs that take a very large fraction of main memory to represent.
The PBGL svn trunk at http://osl.iu.edu/research/pbgl/software should be a proper superset of the 1.42 Boost release, the 0.7 tarball should be almost exactly equivalent. New features are pushed to Boost in a coarsegrained fashion when they're tested/stable.
Let me know if you have any more questions.
Thanks,
Nick
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk