I'm currently studying C optimizations and had a past assignment optimizing a piece of code. Among other optimizations (unrolling loops and strength reduction) I used blocking according to cache size (following Intel's tutorial on the matter):
Now I think I understand why this technique works in that case, where the stride is one, it loads into cache the blocks and reduced the number of misses when accessing the next place in memory. But in my code dst[dim * jj + ii] seems to jump around all over the place since it is being multiplied by jj in the innermost loop. How does the cache account for this? dim is multiplied by 0 then 1 then 2 etc at some point it will surpass what the block can hold and the optimization will be pointless. Am I understanding this right?
In practice, however, when I used blocking only for the jj variable I didn't get the speed up in performance I did from using blocking on both ii and jj. So I made it faster but don't know why. The assignment is past now, but I still don't understand and it's quite frustrating.
Thank you in advance for bearing with what may be a very stupid question.
void transpose(int *dst, int *src, int dim)
{
int i, j, dimi, jj,ii;
dimi = 0;
for(i=0; i < dim; i+=block_size)
{
for(j=0; j<dim; j+=block_size)
{
for(ii = i; ii < i+block_size; ii++)
{
dimi = dim * ii;
for(jj = j; jj < j+block_size; jj++)
{
dst[dim*jj + ii] = src[dimi + jj];
}
}
}
}
}
gcc -O3will auto-vectorize, but IDK if it will do a good job here. Contiguous stores may be better than contiguous loads; then you could vectorize with vector load + shuffle, and vector store. A compiler could vectorize over theiiloop (2nd-from-inner) instead of inner, though, and produce 4dst[]result in parallel. (Except gcc for x86 does a vector load and unpacks it into 4 stores. It could check for overlap between src and dst, but doesn't. godbolt.org/g/g24ehr IDK what clang is doing with a 256-byte copy) - Peter Cordes