Boost logo

Boost Users :

Subject: Re: [Boost-users] distributed breadth first search doesn't work properly?
From: Cedric Laczny (cedric.laczny_at_[hidden])
Date: 2010-12-13 12:10:32


On Monday, 13. December 2010 16:57:23 Mattia Lambertini wrote:
> >I am not familiar with METIS at all, so is there a way to
> >
> > provide the underlying graph in a more intuitive format? An actual image
>
> would
>
> > be nice. Or again, the dot-output for graphviz?
>
> An image of the graph distributed among two processes.
> http://lamberti.web.cs.unibo.it/bfs.png

Thank you. I confused the graph here with the resulting BFS tree.

> If i get your point, process 0 starts a bfs with the n0 as root and process
> 1 starts with n4 as root cause n0 and n4 are the local node 0 of each
> process.

Although it might be that process 1 sees n4 as its root, I don't see why this
should be the case. It is more like the BFS starting at n0 reaches n4 and thus
might take this node as the start-node of a second BFS running in process 1.
But you seem to have got the general idea.

> And the ouput is the minimum distance value of each node
> discovered via n0 and n4 either?
> To what end?

The reason why it takes the minimum distance is simply "by its design". If you
look at my last mail, a distributed algorithm needs to take care that during
its execution an already visited/finished part of the problem will not be
arbitrarily changed at a later moment. Because a BFS starting at n4 will reach
one of its neighbors "faster" than it would when starting at n0, n4 gets the
first chance to write some result (although a node per-se usually will not
perform an action...). When n0 is now at the node that n4 already has finished,
the program needs to know what to do, how to handle that situation. And the
BFS-visitor of PBGL simply states that the value (which will be assigned to
the label in the output) of a node is supposed to be the smallest.
If I correctly intepretate the question "To what end?" you are wondering why
it is the smallest value and not the one that it should have when being
executed with a single process?
If so, this would impose a checking of the process ID and at the long end this
would lead, IMHO, to a distributed-BFS version that is actually fairly the
same as the serial version. After all, it would have to wait each time for
process 0 reaching that node and the running time should become equal (maybe
worse due to the distributed overhead) to the serial version. So would/should
be the result of the distributed execution...
This is what the documentation says:
<quote>
3. Since multiple distance values may be written for each vertex, we must
always choose the best value we can find to update the distance map. The
distributed property map distance needs to resolve distances to the smallest
distance it has seen. For instance, process 0 may find vertex u at level 3 but
process 1 finds it at level 5: the distance must remain at 3. To do this, we
state that the property map's role is as a distance map, which introduces an
appropriate reduction operation:

set_property_map_role(vertex_distance, distance);
<\quote>

>
>
>
> This is the ouput with three processes:
>
> graph g {
> subgraph cluster_0 {
> label="p0";
> n0[label=0];
> n1[label=0];
> }
>
> n0 -- n2;
> n1 -- n1;
> n1 -- n3;
> n1 -- n4;
>
> subgraph cluster_1 {
> label="p1";
> n2[label=0];
> n3[label=1];
> }
>
> n2 -- n1;
> n2 -- n3;
> n3 -- n4;
>
> subgraph cluster_2 {
> label="p2";
> n4[label=0];
> }
>
> n4 -- n0;
> n4 -- n1;
> }
>
> This seems to confirm your explanation (at least what i understood of your
> explanation).

It seems so, yes. However, I am wondering why node n1 (the one with the self-
loop) is assigned a value of 0. If my explanation holds, this could only occur
when a separate process is started on that node. But the grouping of the
subgraphs (matching the number of processes) does not indicate this. It might
very well be that this depends on the internals of the distributed BFS
implementation or general concept of distributing processes over vertices in
PBGL/MPI/whatever... It might sound a bit vague and could be a little bit to
biased as a valid speculation, but it might be that process 0 is available
after finding out that all adjacent vertices have their own processes running
and process 0 is "spawned" freshly on node n1, which is at the same time a
neighbor of n4 and n2. The reason why n1 gets its "own" process might be
similar to the reason why n4 gets its "own" process in the two-process
execution.
Recognizing (or validating) a pattern might need further executions with
higher process numbers and/or executing the BFS for other graphs. It might be
of particular interest to see how a linear graph behaves, where each vertex
has only one predecessor and one successor (except for the "source" and the
"sink")

Unfortunately, I can't really confirm that this is the correct behavior nor can
I say that there's something in particular going wrong. This doesn't help you
much probably. I apologize for that.

Maybe you could tell why you are particularily into a distributed BFS? There
are of course obvious reasons like "It exists and I have the hardware here, so
I want to use it". It might be that your reason is more specific and there's an
alternative or a work-around possible?!

I do understand that you are supposed to solve a particular problem and not
deal with the correction or validation of some code. The sad truth is that
working on an actual problem and veryfing that the basis (even for off-the-shelf
products) works as expected (here rather as designed) go more-or-less hand-in-
hand. I had to experience that several times already.

So unless someone else is dropping in to confirm (or not) the behavior of the
PBGL BFS, it would be interesting (at least for me ;) to see how it behaves
for different scenarios (like more processes or linear graphs) in order to shed
some more light on this.

Best,

Cedric


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net