Subject: Re: [Boost-build] Parallel builds
From: K. Noel Belcourt (kbelco_at_[hidden])
Date: 2010-11-19 16:23:55
On Nov 18, 2010, at 5:10 PM, K. Noel Belcourt wrote:
> On Nov 18, 2010, at 4:28 PM, K. Noel Belcourt wrote:
>> Generally I'm quite pleased with the parallel build throughput of
>> bjam but lately I've noticed some latencies that I'd like to fix up
>> (eliminate). The problem is on larger SMP machines with lengthy
>> compile times (slower compilers). What happens is N compile jobs get
>> started and most complete quickly though perhaps one or a few
>> outliers run a long time or until they time out and are killed.
Found it. When select() returns with data on one or more file
descriptors, we use fread() to read data from the live descriptors.
The problem is that these are blocking file descriptors so most of
the time bjam is waiting, it's actually waiting in fread(), rather
than waiting in select().
Because select() guarantees that the first call to fread() will
return data but subsequent calls may not, and we are in a loop
reading data, once the descriptor has been drained of data, the next
call to read data may block (if the process hasn't closed the
descriptor, and an error or signal hasn't occurred).
There are two possible patches: one is to just call fread() a single
time (not inside a loop) or we can make the file descriptors non-
blocking. It's more efficient to make the descriptors non-blocking
as this allows us to read all data on a descriptor each time select()
returns. The first approach would not permit us to read all the data
on a descriptor (only as much as fits into our buffer).
I tested this patch on Suse, Redhat, and Darwin. built bjam and then
ran regression tests to ensure can keep N cores fully loaded.
Okay to commit?
This is the problem I was seeing.
On 16 core SMP, one process generates a warning message and this same
process takes a longish time to compile. The warning message is
written to a file descriptor causing select() to return, sending us
to read the data on this descriptor. Our call to read data is inside
a loop so we happily read the data on the descriptor but eventually
there's no more data. Because our process hasn't terminated there's
no end of file, and our descriptor is blocking, so bjam waits inside
fread() waiting for more data or end of file (or an error) instead of
waiting in the select() call.
Boost-Build list run by bdawes at acm.org, david.abrahams at rcn.com, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk