Boost logo

Boost :

Subject: Re: [boost] Boost.Plot? - Scalable Vector Graphics
From: Edward Grace (ej.grace_at_[hidden])
Date: 2009-07-30 17:12:55


>>

Again this is a long one but I feel it's worth discussing these
things in some detail --- only by truly understanding the problem can
we try to solve it.

(Something funny is going on regarding my access to this list - I
keep getting things in strange order and with the wrong subject
line. Anyhow - to add my oar in to this discussion.)

>>
>> Well, I normally use a tool like vtune running the code on real data.
>> Do you think such a tool is unreliable?
>>

VTune is a sweet piece of kit, as are other profilers. I don't think
they are unreliable - one just has to understand what they do and /
or don't do and their limitations.

For example, consider the 'fast' and 'slow' functions in my code
which nominally differ in run-time by around 0.5%. Do you think the
'fire and forget' operation of a profiler would *reliably* detect
this difference? [This is not a rhetorical question - I don't know
and am genuinely curious]. It appears that my benchmarking does.
Obviously this is a contrived example, no one is likely to care about
0.5% differences.

For example, when you see that function 'abracadabra' takes 14.3323%
of the run-time - how reliable is that number? When you compare it
to the other function that takes 14.5231% is it actually faster? Are
you sure? What happens if you profile again do you come to the same
conclusion?

 From my point of view profilers are useless as they are for the off-
line optimising of code. My original goal, as I have remarked before
(somewhat obliquely) is that I wish to select the 'best' function or
parameters dynamically at run-time. In other words, consider a set
of functions (algorithms) which nominally do the same thing, I want
to be able to do (I may be syntactically clumsy here)

   std::set<unary_function> sf;
   sf.insert(function_a);
   sf.insert(function_b);
   sf.insert(function_c));
   ..
   ..
   ..
   function_pointer f = get_fastest_function(sf,test_argument);
   ..
   ..
   f(argument); // <--- This is now the fastest implementation of
whatever we want.

Here, the relative speed of the different functions could be 100% and
differ for differing sizes of data such that it may not be possible
to know a priori which is quickest. Likewise it may be run in a
heterogeneous environment so that the knowledge about which function/
parameter combination works best on node "A" is exactly wrong for
node "B".

This may sound like a niche requirement - it is!

> VTune is for tuning application performance. VTune is too big a
> hammer for measuring benchmark runtime. Benchmarks are about wall
> clock time of a piece of code as a metric for performance of
> something else. How you measure that time and how you report those
> measurements is a problem that is prone to error.

Ok, I should have read the whole email before writing the above
monologue -- agreed. Though I'd caution against a bare 'wall clock'
times in favour of relative times (some times of course you may
*have* to measure the wall-clock time).

> Personally I perform several runs of the same benchmark and then
> take the minimum time as the time I report.

Be very cautious about that for a number of reasons.

a) By reporting the minimum, without qualifying your statement, you
are lying. For example if you say "My code takes 10.112 seconds to
run." people are unlikely to ever observe that time - it is by
construction atypical.

b) If ever you do this (use the minimum or maximum) then you are
unwittingly sampling an extreme value distribution:

   http://en.wikipedia.org/wiki/Extreme_value_distribution

For example - say you 'know' that the measured times are normally
distributed with mean (and median) zero and a variance of 10. If you
take, say 100 measurements and then take the minimum - how is that
minimum distributed? In other words if your repeat this - again and
again - what does the histogram of the minimum look like?

What you will find is that the dispersion (variance or otherwise) on
the median is small say \sigma^2=0.15 whereas the dispersion on the
minimum can be much larger \sigma^2=1.8 - *and* the distribution is
highly asymmetrical.

If your two times were nominally separated by around 0.5 you would
not necessarily reliably determine which was fastest from the minima
since the distributions of the minima could overlap.

Attached are two examples, histograms of the observed minimum for
many trials and the observed median for many trials of the above
scenario. Notice that the histogram of the observed minimum is *far*
broader than the observed median.


I suggest you experiment with this - you may be (unpleasantly)
surprised; it's somewhat counter-intuitive an appropriate MATLAB (ok-
ok put away the pitchforks) snippet below

% EV
clear all;
v_median=zeros(50000,1); % Lots of trials to get a decent looking
histogram.
v_min=v_median;
for m=1:length(v_median)
     v=randn(100,1).*sqrt(10); % Normally distributed, try other
distributions.
     v_median(m)=median(v);
     v_min(m)=min(v);
end
figure(1);
hist(v_median,51);
figure(2);
hist(v_min,51);

> This excludes all outliers due to OS and such.

That may well be the aim, but I'm not sure it does get you what you
want. This is where understanding of the maths (statistics) needs to
be blended with an understanding of the machine. Your use of the
minimum could make it *less* possible to infer something from the
measurements. On the other hand you might be right - I don't know
for sure -- but I am sceptical.

> If a car company reported the average (mean or median) 0 to 60 mph
> for a given car when their test driver was trying to see how fast
> it was everyone would think they were crazy.

That's all down to marketing. I imagine they *do* determine it, they
just don't report it that way. Most people couldn't care less what
these numbers actually mean - just that it's 'better' than the other
one. In the infamous words of "Spinal Tap" - "it goes up to eleven!".

> Why should the fact that he pushed the gas too hard on some of the
> trials and spun out and not hard enough on others count against the
> car?

I imagine they test a number of different cars rather than just the
same car a lot of times and that the reported time is some form of
median. This would mean that half the cars will go a little faster -
half a little slower than this 'typical' car. Almost certainly the
testing criteria and their analysis is more sophisticated than that -
speculation on my part -- anyone in the car industry?

> I also typically don't have the luxury of benchmarking on an
> unloaded machine, so I have to make sure to do things fairly, for
> instance, instead of runing many trials of A then many trials of B
> and taking the minimum for each I run many trials of A then B back
> to back and take the minimum. That way A and B on average see the
> same environement over the course of the experiement.

Hopefully. As someone pointed out "It's a dark art"!

> If I run A and B in the same process I run A first then B and then
> B first and then A as a separate run to ensure that the order
> doesn't impact their performance.

Just to add to the mix, I suggest randomising the order in which you
run them (that's what I do) that way you are very unlikely to
accidentally synchronise yourself with some other process e.g.

ABBBABAABAB etc...

> Usually benchmarking implies comparision, so the key concern is
> that the results are fair.

Easy to write very hard to do! ;-)

Regards,

-ed

------------------------------------------------
"No more boom and bust." -- Dr. J. G. Brown, 1997




Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk