>> If you are lucky the compiler can produce code that does this in one
>> instruction set, if not you will get an overhead. Again if h and w are low
>> values you will not notice it.
>And that's exactly what happens when you do loop tiling, the inner h and w are rather small (tile often occupies less than half the L1 cache in size)

>> This example is of course much faster, and yes, it is not elegant nor clear.
>> int *p=array,*pend=&array[h][w];
>> while (p < pend) *p++ = uni() ;
>Well, I copy/pasted this in my array.cpp in place of the loop nest.
>My 2D array take 9.938s to iterate 10000 over a 512*512 image of float, while your "while loop" took 9.891. I only lose ~0.5% by using NRC allocation + >indexing. It's indeed faster but not by that much and, indeed, far less elegant.

When you use an index you loose the computation cost of calculating the address and the maintenance of the two indexes. Only that. I don’t think it has much to do with L1 cache, especially if the few variables are located in a register.

Now how much is that cost? A few CPU cycles per address, maybe 4 not much more with a good compiler/cpu. If that above uni() function need something like 50 cpu cycles then the gain for switching to pointer is very low as in your example. But if instead of the uni function you have a simple add that cost only 1 cycle then you are having a 5 times performance loss. But also this can be not important if the total number of items is relatively low, and this is the common case.

Try to do the pointer access with the simple “res += array[ti][tj] ;” instruction.

In a program where I use much images and arrays the real gain was switching few functions to handwritten assembler code. 6 times faster than an already good c++ algorithm. All the rest of the application is standard elegant and “slow” code. But it doesn’t impact the overall performance.