Boost Testing :
From: Martin Wille (mw8329_at_[hidden])
Date: 2005-06-14 21:56:46
Doug Gregor wrote:
> Regression testing is most effective when we can immediately see the
> results of changes to the repository. We know that "immediately" isn't
> really possible, because Boost takes a long time to build & test and we
> have limited resources. However, I'd like to shoot for one-day
> turnaround on each of our primary platforms if possible.
> To do this, I think we need to try to balance the load a bit. The
> MetaComm and Martin Wille tests cover a huge number of compiler
> variants, but they also have 2-day turnaround times which makes it
> harder to find and fix errors. Let's communicate amongst ourselves to
> divide up the set of tests that each tester runs and try to get in the
> daily results by, say, 10am EST.
The running time depends much on what changes have been done (if you run
the tests incrementally).
A clean run (or a run after something changed in, say, Boost.Config)
takes 1.5 days here (or even longer if something goes wrong). If there
are no changes then a run takes only three hours.
Since the running time isn't predictable, I don't think we can setup the
test runs such that the results are available at a specified time of day.
The toolsets which take most of the time are intel-8.1 (because the
compiler is slow) and gcc-2.95 (because >50% of the tests get recompiled
every run). My plan currently is to drop support for gcc-2.95 (and its
STLport variant) after release 1.33. I'm also considering dropping gcc
3.2.3. It isn't considered release relavent for this release. It'd
hardly be for the next.
Since you made gcc-3.3.6 and gcc 3.4.4 release relevant instead of gcc
3.3.5 and 3.4.3, I can (and will) drop the latter two toolsets.
> For instance, our Linux box (OSL2) is now running gcc-3.3.6-linux and
> gcc-3.4.4-linux nightly, submitting around :00am EST each day, so we
> could speed up the Martin Wille tests a bit by dropping those toolsets.
> We could also add one other toolset to OSL2, but only if it doesn't
> push back the submission time too far. If it does, we'll bring in
> another Linux box (OSL3) to do the testing.
> If need be, we could drop testing of minor variants (gcc-3.3.5-linux
> vs. gcc-3.3.6-linux) to get better throughput.
AFAICS, the differences between 3.3.5 and 3.3.6 are quite small. I kept
3.3.5 and 3.4.3 only because they were marked relevant for the release a
few days ago. I'm going to drop them.
If you have spare resources then you could run intel tests if that
compiler is supported by intel for the Linux distribution you use.
There's no hassle involved, license-wise, in installing the Intel
compiler for testing Boost. Intel doesn't support the distribution I use
and making Intel's install script work involves manual work for every
update and it is quite a bit of a hassle (e.g. it involves installing a
fake RPM database). If you could run the intel tests instead of me then
this would make my life easier and improve my testing throughput.
> My point is simple: More testing is good, but predictable, up-to-date
> results are better.
I fully agree. However, I'd like to add: we need redundancy, too.
As I have written in a message some time ago, there are more factors
that have an influence on the test results than just the toolset. In
case of Linux, I listed: kernel version, threading implementation, CPU,
Python version, libc version (and there are even more). Redundancy helps
us to detect problems related to these more subtle factors and it also
helps in case a test runner is unavailable for some time. Ironically, it
also helps improving the turn-around time for the library maintainers:
if several people run tests for the same toolset then new results will
get uploaded more frequently.
I'd like to remind you of the suggestion I made some time ago: let's
have two result sets for each runner; one "committed" result set which
won't get changed anymore and one "building" result set. The "building"
set would get updated after every single run for a toolset. Once all
toolsets are run, the "building" set becomes "committed" and a new empty
"building" set is created. This improves turnaround times by allowing
for intermediate results to get displayed much earlier than the complete
set. If something breaks badly then the problem becomes apparent quite
quickly, can get fixed quickly and the "building" set can get resetted
before tests have been run for all toolsets.
Of course, it also helps not to attempt to run tests for unsupported
toolsets. Robert already added checks to Boost.Serialization in order to
avoid useless, know-to-fail attempts (thanks!) by checking the compiler
versions. However, I tend to think this approach isn't optimal.
Hardcoding this information into the jamfiles adds a lot of redundancy
(e.g. a lot of the Boost.Config stuff would effectively get duplicated).
I see two possible approaches for an improvement:
1. Add the "unsupported" information to the tests themselves, e.g. by
making them print "unsupported" (we could even add information about
what is unsupported: "unsupported(compiler)", "unsupported(bzlib)").
This would spare us some markup and the information provided would be
more detailed than what the manual markup currently gives us (e.g.
Spirit is marked unusable for gcc-2.95. Some parts, though, would work
on that compiler.)
2. Add another step to build procedure. That step would make the
information from Boost.Config available to bjam. This could be done by
writing a C++ program which writes a jamfile which gets included later.
This would enable a library author to turn the tests for, say, wide
character sets off when they aren't well supported by the environment.
Send instant messages to your online friends http://au.messenger.yahoo.com