> I perfectly agree but I think that in most case, going down that low
level is not needed, vene in image processing. i would rather take a few
minutes to SIMDify
a code if possible than rewriting it in inline assembly.
When intel came out with it’s
first SIMD instruction set I was happy to try them on my application. It was a
failure. Because even if one instruction executes on 3 data locations it’s
cost was 3 CPU cycles. Three integer instructions cost 1 cycle each. Also with
SIMD you had to load the registers first. In my case I lost performance. So I
realized that there is a gain only in some specific case. Not in my situation,
even if it’s an image processing algorithm. Adding more registers to the CPU
would give more benefit to compilers and handwritten code, but adding registers
was never a guideline of Intel.
I hope that today the same SIMD instruction
will execute in only 1 cycle, I should check this one day or another…
And, Yes, optimizing at this
level is a rare situation.
P.S.: sorry Asif