Subject: Re: [boost] [context/coroutine] split into two libs in trunk?!
From: Giovanni Piero Deretta (gpderetta_at_[hidden])
Date: 2012-04-13 18:37:32
On Fri, Apr 13, 2012 at 7:39 PM, Oliver Kowalke <oliver.kowalke_at_[hidden]> wrote:
> Am 13.04.2012 19:21, schrieb Mathias Gaunard:
>> This is incorrect. ucontext is just one of the provided implementations.
>> There is also custom assembly for x86.
> what about the other architectures?
> and has to save/restore the registers as the call conventions require (and
> fcontext does).
> Why should it then be faster?
> At a brief look it does not preserve the SSE2 control and status word as
> well as it does
> not preserve x87 control word.
> If you do not take care about the calling convention and ignore to preserve
> some relevant data of course you can be faster (but it is incorrect code and
> might fail).
Not saving the SSE and x87 control word was a conscious decision on my
part. The control words are unlike other callee/caller saved registers
as they define a process mode and are explicitly under the control of
the user. In my tests the instructions used to load/save these states
had a considerable cost on my old netburst CPU.
The compiler may temporarily change the control state (for example in
legacy x87 mode to implement some non-standard rounding), but it has
to reset them to the original value before calling any externally
defined function (like the ASM context switching functions) as these
will expect the control words to be in the default state (whatever
The only time called functions will see the control words in a non
default state is if the user explicitly changed the state, for example
via a C99 compiler pragma or builtin function. The boost.coroutine
documentation did explicitly warn about risky changes to proces state
across coroutine calls, including the signal mask (which boost.context
also does not preserve), locks, TLS and of course the FPU state.
So, yes, the user might see failures, but never because of hidden
optimizations done by the compiler, but because he explicitly forgot
to restore the sahred state (any state) to a sane default before
switching out of the coroutine.
Having said that, I doubt that on a modern CPU this extra state
save/change would hardly cost more than an extra 50% on a context call
(which in the grand order of things isn't really that much). Any
claimed scalability differences between boost.context and the my old
library must come from somewhere else and not from the low lever
context switching routines. The only thing that comes to my mind is
that boost.coroutine did save all registers on the stack (which is
very likely to be cache hot) instead of a separate structure as for
boost.context (which, IIRC, was heap allocated in the higher lever
FWIW, while it is hard to compare my results on an old 32 bit machine
with yours on an undoubtely newer CPU and OS, I distinctly remember
from my tests that a coroutine-to-coroutine switch (using the high
level API) was about an 100 time faster using the custom backend than
using ucontext (mainly because of the high cost of the function call).
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk