Boost logo

Boost :

Subject: Re: Streamlining benchmarking process
From: Mateusz Loskot (mateusz_at_[hidden])
Date: 2019-05-09 21:11:05


Hi,

I agree with most of what Stefan wrote, so I will respond to his post instead of
to Olzhas' and Stefan's separately, repeating what Stefan wrote already.

On 19-05-09 08:45:41, stefan wrote:
>On 2019-05-09 5:09 a.m., Olzhas Zhumabek wrote:
>
>>First, let me list the problems that this will hopefully solve, in
>>decreasing order of importance:
>>
>>1. Simplify performance issue submission
>>2. Make it easy to get rough approximation of original environment (those
>>that can be made using code only)
>>3. Quickly accept or reject issue (e.g. if it is caused by GIL itself or
>>some environment issue)
>>4. Check if performance degraded significantly
>>
>>Now let me list the cons of the idea that I have thought about:
>>
>>1. There is not much to benchmark yet
>>2. Arrival frequency of performance related issues is very low
>
>I agree to most of the above. I'm not sure about the last point. It's
>true that over recent months we have been concerned with coding issues
>and compile-time performance, but the fact that we didn't have a
>benchmarking infrastructure to measure the impact any of this had on
>(runtime) performance doesn't mean there wasn't any.

Over last months we have been focused on improvements at the source code level to
ensure code is not ill-formed and is warning-less, updating to the C++11, covering
builds with lots of compilers and compilation modes, revealing and fixing bugs.
Concurrently, we've been overhauling the tests, restructuring for greater
maintainability and clarity of what is being covered (and what not), and
finally compiling with lots of compilers, catching and fixing bugs.
The old tests are in test/legacy/ and the rest of the test/ directory content
is new or based ont he test/legacy/ tests.
We are still far from good tests coverage, just ~50%, but we are on the right track.

>I'm fully expecting us to run the benchmark suite (once there is one)
>on par with the test suite for any (significant) code change that may
>affect performance, and I in particular expect this to be of great use
>over the next couple of months as we'll focus on image processing
>algorithms.
>
>So, this is excellent timing !

Indeed, it is a good time to start benchmarking :-)

>>I've wondered around boost libraries and it seems like ublas has a
>>benchmarks folder,

I'd prefer to call it `benchmark/`.
There are `test/` folders, not `tests/`.

>>but uses homegrown benchmarking facility, which might
>>slightly complicate reproduction.
>
>Not really, given that I wrote it. :-)
>
>But that point aside, I'm normally quite averse to the NotInventedHere
>syndrome, i.e. I'd rather avoid reinventing wheels, if possible.

Agreed.

>But before diving into a tools discussion, let's quickly collect
>requirements that any tool(s) we agree on needs to meet. Here is my
>list of use-cases.

Good idea. Let's discuss options, compare and, perhaps summarise on the Wiki.

>Feel free to augment and complement:
>
>* We should define one benchmark per algorithm, parametrized around
>various axes, such as value- and layout types (pixels, channels, etc.)
>to make it easy to compare different instances of the same algorithm.

Yes, parametrisation of benchmark experiments is something we need.

Look for test cases defined with BOOST_AUTO_TEST_CASE_TEMPLATE
https://github.com/boostorg/gil/blob/develop/test/channel/algorithm_channel_arithmetic.cpp#L80-L84
where the same test case runs for each type from fixed type-list of, for example,
channel types:
https://github.com/boostorg/gil/blob/develop/test/channel/test_fixture.hpp#L22-L32
So, ideally if we can parametrise benchmarks that way as well.

Second category of parameters is number of runs/repetitions/samples
and iterations/operations.

Two of frameworks that I have used, google/benchmark and
https://github.com/DigitalInBlue/Celero, determined those dynamically.
Celero allows to specify fixed number as inline parameters of macros.
Google allows to specify those via command-line arguments.
Another one, https://github.com/nickbruun/hayai, only offers fixed macro parameters.

My own 'framework' approach (I don't recommend) can do only fixed too:
https://github.com/mloskot/spatial_index_benchmark
https://github.com/mloskot/json_benchmark

>* Likewise, we should define benchmarks to be able to run over a range
>of (image-) sizes, as performance will vary greatly on that (and
>depend on the hardware we run on, including but not limited to cache
>sizes).

This I recognise as third category of parameters to control problem size for an
experiment.

(I hope I have not messed up the nomenclature or made it confusing :-))

>* It should be possible to run a single benchmark instance, and
>produce a benchmark result file, containing a table (list of
>(size,time) pairs).
>
>* It should then be possible to take multiple such files as input, and
>produce a comparative chart.

Celero has this nice notion/feature of `baseline` against which other algorithms
are compared. Here you can see a sample with `strncmp` as baseline:
https://github.com/mloskot/string_benchmark/blob/master/benchmark_ends_with.cpp
and table with results here
https://github.com/mloskot/string_benchmark/blob/master/results/gcc63_ends_with.csv
and pretty graphs here
http://mloskot.github.io/string_benchmark/results/

Now, I'm wondering this:
if we maintain benchmark structure similar to tests e.g.
- test/algo1.cpp, benchmark/algo1.cpp
- test/algo2.cpp, benchmark/algo2.cpp
could we define build config that allows to build
1) exe per .cpp - useful to bisect, find regressions
2) exe from multiple algorithms, related to the same problem of course,
   with one is used as baseline - comparative benchmarking

>We have already started discussing some of this with Samuel Debionne
>in https://github.com/boostorg/gil/issues/234. Hopefully he is reading
>this mail and can jump in to participate.

I will point Samuel to this thread.

>>I propose the following changes:
>>
>>1. Create benchmarks folder in root of GIL.

I vote for `benchmark/`.

>>2. (Optional) write some simple benchmark to check if google-benchmark is
>>installed properly
>
>Yes. I have quickly looked at google-benchmark, and while I'm not yet
>convinced this meets my expectations, I'm certainly willing to try it
>out and experiment.

Although I'm not convinced to any of the choices yet either, I don't see
any stronger candidates than google/benchmark and Celero, for now.

There is nothing in Boost itself to compare as option.

>>3. Write build scripts (jamfile, cmake+conan) to provide an option to build
>>benchmarks and optionally install google-benchmark using conan
>
>Right, we need to hook the benchmarking up to whatever build system
>people use, to get visibility, and ultimately feedback.

This is the easiest part, but important indeed.

>>4. Import all existing performance issues into that folder
>
>I'm not quite sure what you mean by that.

Me neither.

>>5. Mention in contributing.md that performance issues should preferably be
>>reproduced in that folder as google benchmark and the results embedded into
>>issue.
>
>I expect that once we have a benchmarking suite, we can start
>collecting issues that focus on performance (with a well-defined
>process to reproduce problems locally).

Yup, it will be documented and CONTRIBUTING.md is the right place for that.

>In fact, at that point we
>could add a new issue category called "performance".

https://github.com/boostorg/gil/labels/cat%2Fperformance

>>What do you think?
>>Is this idea even worth it?
>>Or it could be put a bit further into to-do list?
>>If worth it, what changes exactly should I introduce?
>
>I very much agree that this is worth pursuing. And while I'm not yet
>convinced of Google-benchmarks being the right tool for the job, I'm
>open to giving it a try. We can iterate over this for a while, as long
>as we focus on the usability, so we have enough infrastructure ready
>when we start seriously working on new IP algorithms in a few weeks.

Agreed!

Olzhas, thank you for your help in this (credit also goes to Samuel who
kickstarted the discussion about benchmarks for GIL).

Best regards,

-- 
Mateusz Loskot, http://mateusz.loskot.net
Fingerprint=C081 EA1B 4AFB 7C19 38BA  9C88 928D 7C2A BB2A C1F2

Boost list run by Boost-Gil-Owners