Boost logo

Boost Users :

Subject: Re: [Boost-users] [PBGL] Results of distributed BFS on multiple processors vs. on a single processor
From: Nick Edmonds (ngedmond_at_[hidden])
Date: 2011-05-03 19:08:06


On Apr 23, 2011, at 5:09 AM, Cedric Laczny wrote:

> On Friday, 22. April 2011 23:05:04 Nick Edmonds wrote:
>> On Apr 22, 2011, at 11:01 AM, Cedric Laczny wrote:
>>> On Friday, 22. April 2011 16:31:20 Steven Watanabe wrote:
>>>> AMDG
>>>>
>>>> On 04/22/2011 05:59 AM, Cedric Laczny wrote:
>>>>> it recently came to my mind again that the distributed BFS in the PBGL
>>>>> seems to produce different results when applied to multiple processors
>>>>> as to when applied to a single processor (s.
>>>>> http://lists.boost.org/boost-
>>>>> users/2010/12/64783.php)
>>>>> If this is true, I really would like to know what the reasoning is
>>>>> behind this and it should be noted in the documentation of course to
>>>>> help prevent surprises as in the discussion above.
>>>>> Unfortunately, at that time, no definite answer was given and this
>>>>> question puzzled me again now...
>>>>> Any ideas on that?
>>>>
>>>> Why would you expect the results to be exactly the
>>>> same? Parallelism usually introduces a certain
>>>> amount of non-determinism.
>>>
>>> I agree with you on that, generally. And it is good to note this
>>> explicitly again.
>>> However, maybe I just got a weird kind of perception/expectation here,
>>> but especially for search-algorithms, I would expect parallel algorithms
>>> to return the same results as otherwise the question of correct
>>> interpretation of the parallel results arises to me.
>>> I mean, if you get different distance-results depending on the number of
>>> processors, what is then the use of distributed BFS, e.g. if you want to
>>> know the BFS-distance from source to a specific vertex x?
>>
>> The current PBGL (and any other *correct* parallel BFS implementation) will
>> explore one (of a family of valid) breadth-first vertex ordering. For a
>> variety of reasons the sequential visitor semantics do not map well to
>> parallel execution however. Confusion about the semantics of visitors in
>> a parallel context in combination with the fact that the ordering of
>> vertex discovery and edge traversal within levels is necessarily
>> non-deterministic often leads to confusion by users about the results
>> returned by their BFS visitors. A correctly written visitor will return
>> the same distance labels as the sequential implementation, but may not
>> explore the vertices/edges in the same order.
>
> Ok, that's interesting to know. IMHO, someone who simply wants to speed up the
> computation of a BFS by using a parallelised version does generally not care
> about the order of exploration but is interested only in the result, like the
> distance labels, and for such a scenario that should be equal to the
> sequential case. Good to know that this is in fact the case.
> However, this brings me to question, why the parallel BFS boost-example then
> returns different distance labels, depending on the number of involved
> processors? Or is this just a misconception of the output?

Sorry it took me a second to get to this... The BFS example seems to have at least one major issue, the distance labels returned appear to be completely non-deterministic because they are assigned on tree_edge() events which may occur simultaneously on multiple ranks and may continue to occur after a vertex has already been discovered by the owning rank. The result of these assignments is determined by the reduction operation on the distance property map. The default reduction operation is to allow an arbitrary write to succeed which will result in incorrect distance labels when more than one process is involved. I have no idea how long this example has been this way, but judging by the fact that it still uses internal properties I'd hazard to guess a very long time. I'll take the blame for not catching that on the original release. The quick fix for this example would be setting the reduction operation on the distance map to choose the minimum distance label. I'll make that change and check it into the trunk.

That approach should work, but there's also a more subtle issue. The correctness of this approach relies on channel ordering semantics, i.e. that remote operations are performed in the same order as they were generated by the receiver. In this case that means that on the target rank, the put() in the distance map has to happen before the corresponding vertex is removed from the queue, otherwise when its out-edges are explored its distance label will be incorrect and thus so will those assigned to the targets of its out edges. These data structures are independent, as are the sets of remote operations performed on them (i.e. the puts() to the property maps and the queue operations). The existing communication layer happens to enforce sequential ordering across all messages, but this is really just an implementation detail. Furthermore, this works as long as there is only one thread per process, but trying to parallelize these operations within a process would require breaking the underlying channel ordering semantics and result in incorrect behavior.

This is a fundamental issue with the BSP-style programming model of the original PBGL. Control flow (via the queue in this case) and data dependencies (via property maps) are disjoint, and rely on fragile ordering semantics to work properly.

To summarize, mea culpa on not catching this a few versions back and I'll check in a more correct version.

Cheers,
Nick


Boost-users list run by williamkempf at hotmail.com, kalb at libertysoft.com, bjorn.karlsson at readsoft.com, gregod at cs.rpi.edu, wekempf at cox.net