>> If you are lucky the compiler can produce code that does this
in one
>> instruction
set, if not you will get an overhead. Again if h and w are low
>> values you
will not notice it.
>And that's exactly
what happens when you do loop tiling, the inner h and w are rather small (tile
often occupies less than half the L1 cache in size)
>> This example
is of course much faster, and yes, it is not elegant nor clear.
>> int
*p=array,*pend=&array[h][w];
>> while (p <
pend) *p++ = uni() ;
>Well, I copy/pasted
this in my array.cpp in place of the loop nest.
>My 2D array take
9.938s to iterate 10000 over a 512*512 image of float, while your "while
loop" took 9.891. I only lose ~0.5% by using NRC allocation + >indexing. It's
indeed faster but not by that much and, indeed, far less elegant.
When you use an index you
loose the computation cost of calculating the address and the maintenance of
the two indexes. Only that. I don’t think it has much to do with L1 cache,
especially if the few variables are located in a register.
Now how much is that
cost? A few CPU cycles per address, maybe 4 not much more with a good compiler/cpu.
If that above uni() function need something like 50 cpu cycles then the gain
for switching to pointer is very low as in your example. But if instead of the
uni function you have a simple add that cost only 1 cycle then you are having a
5 times performance loss. But also this can be not important if the total
number of items is relatively low, and this is the common case.
Try to do the pointer
access with the simple “res += array[ti][tj] ;” instruction.
In a program where I use
much images and arrays the real gain was switching few functions to handwritten
assembler code. 6 times faster than an already good c++ algorithm. All the rest
of the application is standard elegant and “slow” code. But it doesn’t
impact the overall performance.