Boost logo

Boost Testing :

From: David Abrahams (dave_at_[hidden])
Date: 2007-10-27 13:42:55


Introduction
============

I've installed Bitten (http://bitten.edgewall.org/) on my Trac server
and would like to use it for testing Boost (http://boost.org).
However, Boost has testing needs that I think may not be easily served
by Bitten in its current form -- at least I'm not seeing easy
solutions. Christopher Lenz, the author of Bitten, suggested that I
post about them on bitten_at_[hidden], thus this message, which
is also cross-posted to the boost-build and boost-testing lists.

Quick Overview of Bitten (for the Boost people)
===============================================

Bitten is a continuous integration system that's integrated with Trac.

Testing machines (called "slaves") run a Python script (part of the
Bitten distribution) that checks periodically with the server to see
if anything needs to be tested. The server manages a set of testing
"recipes," each of which is associated with a directory in the
subversion repository and a set of slave criteria (such as "are you
running Linux?"). If a recipe has criteria matching the inquiring
slave and there have been changes in the associated part of the
repository, the slave executes the recipe.

A recipe is divided into a series of "steps," each of which is
composed of commands
(http://bitten.edgewall.org/wiki/Documentation/commands.html). Slaves
report back to the server as they execute each step, so the server can
tell you about the progress of each slave. If any step fails, the
build fails and the slave stops. Recipes are expressed in XML and
stored in the server's Trac database. Trac users with the right
privileges can create, edit, and delete recipes.

Quick Overview of Boost Testing (for the Bitten people)
=======================================================

Boost is a collection of C++ libraries. Each library is maintained by
at most a handful of individuals. Some libraries depend on other
libraries.

Boost uses its own build system, Boost.Build, which allows developers
to express build- and test- targets without special knowledge of the
platform (OS and specific compiler toolsets) that might be used to
create those targets. That's important, since most developers don't
have direct access to all the platforms they might want to support.

Each library author writes a suite of tests using Boost.Build. Anyone
who wants to test his library locally can run Boost.Build in his
library's test/ directory and see if any test targets fail. That's
important too -- we don't want people to have to install a bitten
slave in order to do local testing.

Some Boost.Build test targets only pass if others fail to build.
Boost uses that paradigm to make sure that constructs prohibited by a
library at compile-time actually fail to compile. This kind of
expected failure is encoded in the build specification just like the
creation of an executable or library.

Usually because of bugs or missing features in a platform, a library
may be unsupported or only partially supported on that platform, so
our test results feedback system post-processes the Boost.Build
results to decide how to represent failures that may be known. This
kind of expected failure is represented in
http://boost.org/status/explicit-failures-markup.xml, which allows us
to say whether any given library or test is expected to work on any
given platform, and if not, the possible reasons for the failure.

Our results feedback system (see
http://beta.boost.org/development/tests/trunk/developer/summary_release.html
for example) shows reports for each library, describing its status two
ways: for developers
(http://beta.boost.org/development/tests/trunk/developer/summary_release_.html#legend)
and for users
(http://beta.boost.org/development/tests/trunk/user/summary_release_.html#legend).
Of course there's a detailed results view for each library as well
(http://beta.boost.org/development/tests/trunk/user/parameter_release.html)

Boost's web UI for results feedback leaves a lot to be desired, but
also has many unique features of real value to Boost. See
http://lists.boost.org/Archives/boost/2007/08/125724.php for a
summary.

Issues with Using Bitten for Boost Testing
==========================================

* Where do recipes come From? Will library writers author them? If
  so, won't that be redundant with information that's already captured
  in Boost.Build declarations? If they're automatically generated
  (say, by Boost.Build), how will they get into the Trac database?

* What is the granularity of recipes and steps? Probably the ideal is
  that there's a recipe for each library. Testers with a particular
  interest in some library could configure their slaves to only match
  those recipes. That would also keep testing from stopping
  altogether just because one library failed.

  We'd probably also like to have one step for each of a library's
  tests, so that we can get detailed feedback from Bitten about what
  has failed and where.

  As I understand it, however, a slave stops executing a recipe as
  soon as one of its steps fails, but we probably don't want Bitten to
  stop testing a whole library just because one of its tests fails (we
  might want to stop if ten tests fail, though -- right now we just
  keep on going).
  
  Also, using a step for each test means invoking the Boost.Build
  engine (bjam) separately for each test, which would be inefficient;
  you'd pay the startup cost for BB at each step. So unless Bitten is
  changed, we probably have to test a whole library in one step.

* What about parallelization? It seems to me that if we break up the
  tests with one recipe and one step per library, we can fairly easily
  parallelize testing of a single library but testing multiple
  libraries in parallel could lead to race conditions if any two
  libraries depend on a third one. Bjam would get launched once for
  each dependent library, and both processes will try to build the
  depenency at once.

  One obvious approach to parallelization involves doing something we
  should do anyway: testing against the libraries as built and
  installed. If installation were a separate recipe then we could run
  tests in parallel to our hearts' content. The problem is that AFAIK
  there's no way to express a dependency graph among build recipes in
  Bitten.

* A library should only be considered "broken" on a given platform if
  a test that fails is not marked as expected, e.g. due to platform
  bugs (or perhaps a test was passing in the previous release,
  indicating a regression -- but in that case the test probably should
  not be marked as an expected failure). That seems to imply the
  ability to postprocess the testing results against
  http://boost.org/status/explicit-failures-markup.xml before deciding
  that a test has failed. Today we do that processing on the server,
  but I think it would be reasonable to rewrite it (in Python) to
  happen on the slave. That would probably make the system more
  responsive overall. It would also be reasonable to break
  http://boost.org/status/explicit-failures-markup.xml into separate
  files for each library so one library's test postprocessing doesn't
  have to see the markup for all the other libraries.

I'd be especially interested in hearing from Bitten folks if there are
ways of using Bitten that I've overlooked, or if they see particular
low-hanging features that would make testing Boost with Bitten more
practical.

Thanks for reading,

-- 
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com

Boost-testing list run by mbergal at meta-comm.com