> When intel came out with it's first SIMD instruction set I was happy to
> try them on my application. It was a failure. Because even if one
> instruction executes on 3 data locations it's cost was 3 CPU cycles. Three
> integer instructions cost 1 cycle each. Also with SIMD you had to load the
> registers first. <snip>

MMX and SSE were rather a disaster on this point.
Acually, I learn to play with SIMD using Altivec on PowerPC and that's a complete different deal. SSSE3 and upcoming SSE4 are ratehr good too.

> And, Yes, optimizing at this level is a rare situation.
I will be curious to know which kind of IP algorithm need those but this is maybe a topic that should go private instead of adding noise to the list.