Ok, after trying out A LOT of different approaches to do this, using all kinds of ublas matrices, etc...I got the same, super poor performance. From all my experiments its possible to clearly see now that cache is not an issue. Now, after trying so much, I am completely hopeless I'll get to make it work as I want. My only choice now is to give up on ublas and try something else, perhaps implementing my own basic Matrix classes.

Thanks for the help guys.
