Here goes a sample code which performs pretty much the same operations that are performed in the main code, also with the same kind of access patterns. It is much faster than my code (around x10+ faster), probably because everything is very well aligned in contiguous blocks of memory like the element stiffness matrices and indices, which again suggests I have cache issues. Then my only hope now is to try a new data structure design where all these data which is used in that section of the code is contained in contiguous blocks of memory, something more cache friendly.
Any help is welcome.
Thanks,
x