From: Olzhas Zhumabek (anonymous.from.applecity_at_[hidden])
Date: 2021-07-06 04:57:11
I thought about the effects of std::inner_product, and reached the
that it probably does not contribute anything. The main problem is that at
some iterations the start point is probably not aligned for SIMD
instructions to be executed. Could you please try to write the kernel
application loops by hand and see if there is a difference? I believe
there is significant time spent in rotation of the cache. The loops will
1. Row loop:
1.1 Copy into cache loop(s)
1.2 Kernel applications loops
Notice that 1.2 is not inside 1.1.
The other thing to investigate is to have some threshold at which
brute force algorithm will be used. It is probably faster as it doesn't
involve unnecessary copies and both the image and the kernel
fit into the cache anyway.
The last option to consider is not to load the whole row by kernel height.
Try to load as much as there is cache size minus, say, 512 bytes for
Please use Celero's value generation to find those thresholds, as
trying out by hand will waste time for compilation. Just write a new
and generate some values.
Ideally we need to collect some metrics on performance, like cache misses
and stuff like that. If you have an Intel CPU, please try out VTune. We can
check it out together in our next meeting.
Boost list run by Boost-Gil-Owners