Vardan,
I get the same performance with double precision. I have NDEBUG defined, but I don't know about other optimization flags. I didn't profile with any tool but the bottle neck is exactly at that section of the code where I perform the element wise summation. I'll try to make a simple sample which performs the same kind of operations to see what I get.
Thanks.
x