> I perfectly agree but I think that in most case, going down that low level is not needed, vene in image processing. i would rather take a few minutes to  SIMDify a code if possible than rewriting it in inline assembly.

When intel came out with it’s first SIMD instruction set I was happy to try them on my application. It was a failure. Because even if one instruction executes on 3 data locations it’s cost was 3 CPU cycles. Three integer instructions cost 1 cycle each. Also with SIMD you had to load the registers first. In my case I lost performance. So I realized that there is a gain only in some specific case. Not in my situation, even if it’s an image processing algorithm. Adding more registers to the CPU would give more benefit to compilers and handwritten code, but adding registers was never a guideline of Intel.

 

I hope that today the same SIMD instruction will execute in only 1 cycle, I should check this one day or another…

 

And, Yes, optimizing at this level is a rare situation.

 

P.S.: sorry Asif