Subject: Re: [boost] [Fibers] Performance
From: Hartmut Kaiser (hartmut.kaiser_at_[hidden])
Date: 2014-01-11 15:12:59
> > Do you have some performance data for your fiber implementation? What
> > is the
> > (amortized) overhead introduced for one fiber (i.e. the average time
> > required to create, schedule, execute, and delete one fiber which runs
> > an empty function, when executing a larger number of those, perhaps
> > 500.000 fibers)? It would be interesting to see this number when
> > giving 1..N cores to the scheduler.
> unfortunately I've no performance tests yet - maybe I'll write one after
> some optimizations (like replacing the stl containers by a single linked
> list of intrusive_ptr).
I'd write the test before starting to do optimizations.
> I'm not sure what a fiber should execute within such a test. should the
> fiber-function have an empty body (e.g. execute nothing)? or should it at
> least yield one time?
Well, that are two separate performance tests already :-P
However, having it yielding just adds two more context switches and a
scheduling cycle, thus I'd expect not too much additional insight from this.
While you're at it, I'd suggest to also write a test measuring the overhead
of using futures.
For an idea how such tests could look like, you might want to glance here:
> if the code executed by the fiber does nothing then the execution time
> will be determined by the algorithm for memory allocation of the clib. the
> context switches for resuming ans suspending the fiber and the time
> required to insert and remove the fiber from the ready-queue inside the
> the fiber-scheduler.
That's assumptions you're having which are by no means conclusive. From our
experience with HPX (https://github.com/STEllAR-GROUP/hpx) the overheads for
a fiber (which is a hpx::thread in our case) are determined by many more
factors than just the memory allocator. Things like contention caused by the
work stealing or by NUMA effects such when you start stealing across NUMA
domains usually overshadow the memory allocation costs. Additionally, the
quality of the scheduler implementation affects things gravely.
> this queue is currently a stl container and will be replaced by a single-
> linked list of intrusive-ptrs.
If you had a performance test you'd immediately see whether this improves
your performance. Doing optimizations based on gut feelings are most of the
time not very effective, you need measurements to support your work.
> a context switch (suspending/resuming a coroutine) needs ca. 80 CPU cycles
> on Intel Core2 Q6700 (64bit Linux).
Sure, but this does not tell you how much time is consumed by executing
those. The actual execution time will be determined by many factors, such a
caching effects, TLB misses, memory bandwidth limitations and other
IMHO, for this library to be accepted, it has to prove to be of high quality
which implies best possible performance. You might want to compare the
performance of your library with other existing solutions (for instance TBB,
qthreads, openmp, HPX). The link I provided above will give you a set of
trivial tests for those. Moreover, we'd be happy to add an equivalent test
for your library to our repository.
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk