Boost logo

Boost :

Subject: Re: [boost] [xpressive] Performance Tuning?
From: Edward Grace (ej.grace_at_[hidden])
Date: 2009-07-28 18:23:46


>
Paul,

[Excuse the following - perhaps it's an attack of logorrhea ]

> I also am encouraged by what Edward's timer might add though I'm a
> little
> wary of possible inclusion of code following FFTW's license as it
> may be
> incompatible with Boost.

I'm pretty sure the FFTW license is Boost compatible (judging by the
comments in the header and the BSD-like adoption of FFTW in many
commercial and open source projects) however my aim is for a robust
performance framework that is chronometer agnostic - there's no need
to use FFTW's timer.

The getticks() macro just happens to be a handy cross platform
implementation!

There are other concerns that I need to address regarding the
licensing and provenance of this code. For now I am simply keen to
see if it works for other people and in other real world situations -
hence the application to the boost::spirit / xpressive debate.

After all, you can read books on nonparametric statistical inference
until you are paralysed with inaction - the only real way to know if
it works is to try it and get other people involved and apply it in
real-world situations.

 From that I'm sure I will learn a lot and then be in a situation to
weed out unnecessary cruft (e.g. bootstrap resampling) and add useful
stuff (potentially the cache related code mentioned earlier). After
that it's a case of reviewing everything that's left after I've done
my weeding and making sure it's appropriate.

> Edward, I do wonder about the statistical significance metrics you
> provide
> (great idea by the way).
> I'm wondering if your code assumes a normal timing duration
> distribution and
> if so, can it 'measure' whether the distribution of results conform
> to this
> assumption.

It makes few assumptions about any underlying probability density
function (not even that it's strongly stationary) since it's based on
nonparametric techniques.

Key to this is that, while the underlying distribution may not be
normal or even known, the distribution of certain statistics
estimated from a sample *is* well known.

For example, the median is asymptotically normal (with certain
constraints on the PDF). In other words if you take a large sample
and calculate the median - then repeat this ad nauseam, the resulting
medians will be approximately normally distributed.

For a concrete example this is put to use in the calculation of the
confidence bounds - by using the Wilcoxon signed-rank test for which
the W+ statistic has a well known distribution. We can therefore
determine if one function is faster than another in a (hopefully)
statistically significant way.

> In my experience, timing benchmarks can suffer from having outliers

I don't view these as 'outliers' in the normal usage of 'measurement
error' - they are measurements that are just as valid as any other -
one just has to ask "what is this measuring - the function under test
or the OS?". Often it's the latter but there is (as I see it) no
clear border-line between classifying any given measurement as one
and not the other. Consequently I view 'throwing away' (trimming)
measurements with great suspicion.

> (usually
> OS-induced by not having a RTOS)

RTOS? Real-time OS?? Do these eliminate this effect?

> that make the distribution less than
> 'normal'.

The distribution would not be normal even in principle. It must have
a well defined minimum, but (in principle) has no upper bound - the
(gross) lack of symmetry rules out normality in a heartbeat!

> Would it be possible to establish that the given sample set of timings
> correspond to a normal distribution (or perhaps discard a certain
> percentage
> of outliers if necessary).

It is possible to determine if the sample corresponds to a known
distribution - the Kolmogorov-Smirnov test. However which one should
we check? There is no fundamental requirement for the measurements
to belong to any family - even vaguely. In fact it may not even be
smooth!

> I'm no statistics person, but I have seen cases
> where 10,000 timing samples have been biased by 10 samples that
> probably
> relate to a program issue or a severe OS issue. e.g. normally 1 ms per
> iteration, 10 @ 1-10s each

Yes - that issue has a dramatic impact on the robustness of (or lack
thereof) of the mean.

Often of course these things are even vaguely IID (independent and
identically distributed) - burst like behaviour can lead to strongly
non-stationary behaviour. For example some garbage collector might
go nuts and skew your timing. That's why, in my view, it's so
important to compare relative timings and to obtain confidence bounds.

Pragmatically one might argue that when you're doing these things at
the console during development and making decisions it's not really a
major issue. One can look at the numbers and say "hmm, fishy - I
don't trust that" and tinker around until the result is in line with
your expectation. That's part of the art that is experimental
science and many of us do it everyday without even realising.

How to codify that opinion into something that's autonomous and
capable of e.g. selecting the 'correct' function in a high-load multi-
user environment in the real world is far from clear however.

That is my goal.

-ed
------------------------------------------------
"No more boom and bust." -- Dr. J. G. Brown, 1997


Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk