Subject: Re: [boost] Flow-based programming library for Boost?
From: Julian Gonggrijp (j.gonggrijp_at_[hidden])
Date: 2012-12-16 18:22:25
Thanks for clarifying, Dave. I'm glad to know that we're mostly on the same page.
Dave Abrahams wrote:
> on Sat Dec 15 2012, Julian Gonggrijp wrote:
>> [...] In part I'm
>> puzzled by what seems to me to be your bolder claim: the possibility
>> of statically wired asynchronous concurrent message passing.
> I don't see why it shouldn't be possible.
In retrospect I do see why you didn't see it. Note the word
"asynchronous". I'll get back to this point below.
>> [...] I believe we're not disagreeing at all. You seem to be implying
>> that for very tightly interdependent computations static, single-
>> threaded connections are most efficient, while for more coarse-grained
>> computations concurrency is the best solution.
> s/is/might be/
> In the degenerate case where the dependency graph is strictly linear,
> concurrency might be completely wasted unless you're able to take
> advantage of pipelining.
Isn't execution of linearly dependent operations always a pipeline?
Assuming that, isn't every vertex in a dependency graph a pipeline by
Assuming that, isn't pipelining the entire purpose of any form of
> And there's still a question of how that concurrency is carved up. If
> you have a complicated dependency graph and you can run several of these
> flow graphs in parallel, but each in a single thread (as described
> below), that seems likely to induce less synchronization than if you try
> to run the nodes of a single flow graph in parallel. Of course whether
> that decomposition can work depends on your problem.
You have a point. But: suppose that you have two operations, A and B,
and you want to output B(A(item)) for each item out of a collection of
inputs. The alternatives that you're suggesting are
(1) have operation A executed by one thread and operation B by the
other, then pipeline these threads;
(2) have two (or more) identical threads, each taking an equal share
of the inputs and executing B o A.
In scenario (2) the threads don't need to communicate, so you're right
that it will involve less waiting if you generalize this to more
complex computations, but now we're not taking into account that (1)
is only a member out of a much larger class of solutions to the same
problem. If we realistically assume that one operation takes longer
than the other, let's say that B takes three times the amount of time
of A, then the following network will prove superior to (1) and (2) on
the condition that you have a CPU with at least four hardware threads:
(1b) We use four execution threads (or any multiple). The first takes
the inputs, applies operation A on them and divides the intermediate
results between the other three threads. The other three threads apply
operation B, each to their own share of the intermediate results.
This also generalizes to more complex dependency graphs. In fact, the
more operations are involved, the more likely you are to find
significantly different execution times.
>> [...] I find it very hard to imagine how this could be
>> done, because at compile time it's impossible to predict how the
>> instruction streams of both threads will be aligned during run-time.
>> If something like that could actually be done, *please* enlighten me.
> Well, I have to admit, that sounds difficult. But you don't necessarily
> need more than a single-value buffer for each sub-computation, do you?
Indeed you don't, in fact I think you almost never do. Rather, the
multiple-value buffer is mostly an optimization that enables
asynchronous operation. The sender can push another value to the queue
even if the receiver didn't pop the previous one yet, and vice versa.
On top of that, every thread only needs to wait for its own
preemptions, not for those of the other threads (unless preemption
takes longer than filling/emptying the buffer).
The problem with a single-value buffer, whether static or dynamic, is
really that it can't be asynchronous; production and consumption of a
message must always take turns. I believe this largely defeats the
purpose of concurrency. Regardless, if you are to implement concurrent
message passing with a static single-value buffer, you still have the
problem that you'll somehow have to align the instruction streams of
both threads. Now this seems less remote to me than the same thing
with a multiple-value buffer, but I imagine that the act of aligning
the instruction streams would add considerable overhead at run-time.
After all, wouldn't that just be a barrier?
>> [...] You appear
>> to believe that the same library should provide for both ways of
>> composing computations, i.e. the static single-threaded one (like
>> Accumulators) and the dynamic asynchronous one. I doubt that would be
>> the right thing to do, because the modes of composition are not
>> related in their underlying logic (contrary to the static and dynamic
>> expressions in Xpressive) and because they pose very different
>> requirements on the subcomputations that are to be composed.
> You may be right.
>> [...] As the
>> Accumulators mode of composition involves much tighter coupling
>> between the operations, they need to have much more in common, e.g.
>> features and dependencies in Accumulators.
> Hmm, I don't think I buy that last part of the argument, but I also
> don't think we're very far apart now.
It might not be important enough to elaborate on. I agree that we're
probably not far apart.