On Tue, Apr 28, 2009 at 5:05 PM, arm2arm <arm2arm@gmail.com> wrote:

To: Ovanes
Are there gain in the speed if I would use the for_each?
I am always avoiding to use for_each to allow the compiler (like INTEL)
auto-parallelize the regions.
But for this particular case Is not a issue.

to be honest, that's strange. I know from MSVC that using std::algorithms allows parallelisation.

e.g. using std::find of a 32bit numeric value in a vector runs up to 4 times faster, due to XMM register optimization. Copying out the std::find loop implemation runs slower, since the compiler does not know how the vector pointed by pointer/iterator is aligned. It is pretty well explained here:
http://www.agner.org/optimize/optimizing_cpp.pdf Chapter 11. My experiments with MSVC 2003 & 2005 show that searching for a number (with is not present in the vector) in a loop copied from find impl is 4 times slower as using find itself. I am curious how it is with Intel compiler.

IMO STL algorithms delivered with compiler are pretty well optimized for that particular compiler version. If you write your loop, it is for sure not faster as comaprible STL algo distributed with the compiler. Does Intel explicitly state that they do not optimize STL code and it is not parallel nor the STL does not use XMM registers?

Thanks,
Ovanes