Subject: Re: [boost] [msm] scalability
From: Christophe Henry (christophe.j.henry_at_[hidden])
Date: 2010-02-08 16:41:15
David Abrahams wrote:
> The idea would be to do all parts that are hard on the compiler
> at runtime. I don't see why you should have to sacrifice features.
I promised to make an analysis of which feature costs how much
compile-time and test your point. I hope to bring a beginning of an
answer and hope you will enjoy the trip to VC++ world ;-)
I took the iPod example from the last BoostCon and compile it with VC8
and VC9 in Debug and Release. I think it is an interesting example:
submachines, orthogonal regions, a good number of states. I also took
the MPLLimitTest example (80 transitions, 80 states) to check a wide
state machine not factored into any submachines. Both can be found in
the sandbox in libs/msm/doc (/iPod or directly for the MPLLimitTest).
So, the results:
VC9 (Release): 60s
VC8(Debug): 183s (!!!)
Notice how debug takes so much longer, as rightly noticed by Jeff.
Part1: we now move the submachines into different headers, simply
included in the main application. Should change nothing, right? First
small surprise, it helps VC8 in Debug:
VC9 (Release): 65s
I didn't see why Debug took so much longer, so I played a bit with the
compiler options until I found the culprit. If we now deactivate the
/Gm option ("Enable minimal rebuild") in Debug (not activated in
Release), we now see a big difference:
Waow! 20s better for VC9, 102 for VC8. This could explain Jeff's
problems. This is what I call a successful optimization from the
compiler to create a minimal rebuild. Now Debug builds are faster than
Release (due to shorter link time if you ask).
This is nice but we did not add any new policy for faster compiling
yet. I now add one. Unfortunately, reducing the number of template
instantiations proved to be a hard business as most of removed
instantiations which were "hard on the compiler" were still done
somewhere else and VC is clever enough not to repeat itself.
On the MPLLimitTest it gives:
VC9(Debug): 135s (without /Gm. Otherwise count 275!)
VC8(Release): crash :( (ok this is really a hard test)
VC8(Debug): I stopped after the compiler used up 3,5GB of my 4GB...
VC8(Release): crash :(
Ok better. It can be also better for the iPod example as the policy
allows you to move some of the metaprogramming for submachines to
other TUs and compile with 3 cores (one fsm, 2 submachines) using /MP
(available on VC8 and VC9).
See in the doc the example in iPod/Part2. We now have the following
compile time (using 3 cores):
This looks much better as it allows the clever user to build
submachines and enjoy better compile times. For cases where
performance is needed, simply omit the submachines' cpp files from
The trick works using a boost::any to contain the event, thus allowing
you to avoid instantiating a process_event inside the main machine
The only feature you lose with this policy is the new possibility to
make a derived event fire his basis event's transition (any can
recognize only strict types).
So this is the current state. But it can be better. We still pay for
the instantiation of submachines into the main machine, which still
costs you time. I made a test using a proxy for submachines. The proxy
containing only a void* to the real submachine, instantiating it is
cheap. It's not finished but we now build on 5 cores (the main fsm, 2
for each submachine, 2 submachines). Yes we can now build a submachine
on 2 cores :)
This gives us the possible future compile-time:
I hope this will make the compile-time problem more manageable. Again,
more to come on this later.
Jeff, if you could give it a try, it would be really great!
PS: I am using a Q6600, 4 cores, 2.4GHz, no hyperthreading. We can
soon use the new hexacore with hyperthreading :)
Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk