Boost logo

Boost :

Subject: Re: Streamlining benchmarking process
From: Mateusz Loskot (mateusz_at_[hidden])
Date: 2019-05-09 23:54:01


On 19-05-09 18:39:58, Stefan Seefeld wrote:
>On 2019-05-09 6:11 p.m., Mateusz Loskot wrote:
>>On 19-05-09 17:32:53, stefan wrote:
>>>On 2019-05-09 5:11 p.m., Mateusz Loskot wrote:
>>>>Now, I'm wondering this:
>>>>if we maintain benchmark structure similar to tests e.g.
>>>>- test/algo1.cpp, benchmark/algo1.cpp
>>>>- test/algo2.cpp, benchmark/algo2.cpp
>>>>could we define build config that allows to build
>>>>1) exe per .cpp - useful to bisect, find regressions
>>>>2) exe from multiple algorithms, related to the same problem of course,
>>>>  with one is used as baseline - comparative benchmarking
>>>
>>>I'm in favor of 1). Whether that's truly one algorithm or not,
>>>however, depends on whether the same code is parametrized to yield
>>>multiple implementations.
>>
>>My (loose) idea was more like this:
>>algo1.cpp is strstr()
>>algo2.cpp is std::string::find()
>>Each algo[1-2].cpp solves the same problem but algorithm.
>>
>>1) build to run just benchmark of strstr() (parmeterised for problem
>>size)
>>2) build to run both together with e.g. strstr() as baseline - yes,
>>I'm a bit
>>  biased by Celero framework here.
>
>My use-case / requirement was that it must be possible to run
>algorithms independently. Of course, in the end, you may want to
>"aggregate" results from multiple runs, to do some comparison or other
>analysis. But that step is fully orthogonal to the run itself.

My point was a bit deeper than just results aggregation.

(I'll keep using the strstr vs string::find analogy.)

Single algorithm runs give absolute values that 1) need historic results to
compare 2) need fixed parameters of runs/iterations, and is useful in
regressions finding:

1. May 1, 1000 iterations of strstr() takes 10s
2. Jun 1, 1000 iterations of strstr() takes 7s
3. Jul 1, 1000 iterations of strstr() takes 11s

or bisecting
or monitoring optimisation effects (e.g. compilation flags, algorithm
improvements) while working on improvements (just time scale is different: now,
1 hour later, etc.).

Benchmarking multiple algoritms, with adaptive parametrisation of runs, iterations
(the default in google/benchmark and Celero) is useful for relative comparison of
different algorithms against some baseline.

To me, the two approaches serve different goals and answer diferent questions:
- Has performance changed since...?
- Which one is faster?
Both are equally useful.

>>We want to compare apples to apples, obviously.
>>
>>Again, it was loose brainstorm. Doesn't have to be important for now
>>about how we aggregate benchmark implementations.
>
>True, as long as the requirements are met. (I'm mentioning it because
>before my re-implementation, Boost.uBLAS benchmarks would perform
>multiple algorithm executions from the same executable (process),
>which prevents me from recording results separately. This was one
>reason for me not to keep the code as it was.

I understand.

I prefer to piggy back on common formats, e.g. CSV, then I can aggregate
either multiple single-result files or multiple multi-result file.

>>>you can see how a single compilation unit / executable can
>>>represent the same algorithm for multiple (template) parameters -
>>>value-types, or in our case, pixel types or other compile-time
>>>parameters.
>>
>>I see. However, it's a bit different to what I had in mind.
>>I'm more fond of benchmark cases 'generator' similar to this:
>>
>>   BOOST_AUTO_TEST_CASE_TEMPLATE(test_case_name,
>>       formal_type_parameter_name, collection_of_types);
>
>
>That's an implementation detail, which just so happens to use some
>"data driven test" idioms to express the parametrization. I'm not
>saying that's necessarily a bad choice (in fact, it's quite similar to
>my code above, I believe), as long as it lets you (I repeat myself, I
>know) control via the CLI which parameter type you would like to
>execute the benchmark with.

I take it as a valid point.

>(Running the whole set of types *might*
>work, but seems to impose requirements  on how to store benchmark
>results. That is, you loose the ability to store them independently,
>and later aggregate them freely. The aggregation is already baked into
>the code structure.

If the output is in CSV, in your case you get output with one row and in my
case you get multiple rows. Then, I usually have a script to aggregate or chart
all rows or selected row(s).

However, to me, the key points/differences are what I explained in the beginning
of this response. The issue of aggregation is secondary. I see both approaches
as complementary.

Best regards,

-- 
Mateusz Loskot, http://mateusz.loskot.net
Fingerprint=C081 EA1B 4AFB 7C19 38BA  9C88 928D 7C2A BB2A C1F2

Boost list run by Boost-Gil-Owners