776
votes

Here is the extract from the program in question. The matrix img[][] has the size SIZE×SIZE, and is initialized at:

img[j][i] = 2 * j + i

Then, you make a matrix res[][], and each field in here is made to be the average of the 9 fields around it in the img matrix. The border is left at 0 for simplicity.

for(i=1;i<SIZE-1;i++) 
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        for(k=-1;k<2;k++) 
            for(l=-1;l<2;l++) 
                res[j][i] += img[j+l][i+k];
        res[j][i] /= 9;
}

That's all there's to the program. For completeness' sake, here is what comes before. No code comes after. As you can see, it's just initialization.

#define SIZE 8192
float img[SIZE][SIZE]; // input image
float res[SIZE][SIZE]; //result of mean filter
int i,j,k,l;
for(i=0;i<SIZE;i++) 
    for(j=0;j<SIZE;j++) 
        img[j][i] = (2*j+i)%8196;

Basically, this program is slow when SIZE is a multiple of 2048, e.g. the execution times:

SIZE = 8191: 3.44 secs
SIZE = 8192: 7.20 secs
SIZE = 8193: 3.18 secs

The compiler is GCC. From what I know, this is because of memory management, but I don't really know too much about that subject, which is why I'm asking here.

Also how to fix this would be nice, but if someone could explain these execution times I'd already be happy enough.

I already know of malloc/free, but the problem is not amount of memory used, it's merely execution time, so I don't know how that would help.

2
@bokan it happens when the size is a multiple of the critical stride of the cache.Luchian Grigore
@Mysticial, it doesn't matter, it exposes same exact problem; code can be different, but basically both of the question ask about the same time (and their titles are definitely similar).Griwes
You should not process image using 2 dimension array if you want high performance. Consider all pixels being in a raw and process them like a one dimension array. Do this blur in two pass. First add value of surrounding pixels using a sliding sum of 3 pixels : slideSum+=src[i+1]-src[i-1]; dest[i]=slideSum;. Then do the same vertically and divide at same time : dest[i]=(src[i-width]+src[i]+src[i+width])/9. www-personal.engin.umd.umich.edu/~jwvm/ece581/18_RankedF.pdfbokan
There's actually two things going on here. It's not just super-alignment.Mysticial
(Just a minor nitpick on your answer. For the first code segment, it would be nice if all your for loops had braces.)Trevor Boyd Smith

2 Answers

979
votes

The difference is caused by the same super-alignment issue from the following related questions:

But that's only because there's one other problem with the code.

Starting from the original loop:

for(i=1;i<SIZE-1;i++) 
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        for(k=-1;k<2;k++) 
            for(l=-1;l<2;l++) 
                res[j][i] += img[j+l][i+k];
        res[j][i] /= 9;
}

First notice that the two inner loops are trivial. They can be unrolled as follows:

for(i=1;i<SIZE-1;i++) {
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        res[j][i] += img[j-1][i-1];
        res[j][i] += img[j  ][i-1];
        res[j][i] += img[j+1][i-1];
        res[j][i] += img[j-1][i  ];
        res[j][i] += img[j  ][i  ];
        res[j][i] += img[j+1][i  ];
        res[j][i] += img[j-1][i+1];
        res[j][i] += img[j  ][i+1];
        res[j][i] += img[j+1][i+1];
        res[j][i] /= 9;
    }
}

So that leaves the two outer-loops that we're interested in.

Now we can see the problem is the same in this question: Why does the order of the loops affect performance when iterating over a 2D array?

You are iterating the matrix column-wise instead of row-wise.


To solve this problem, you should interchange the two loops.

for(j=1;j<SIZE-1;j++) {
    for(i=1;i<SIZE-1;i++) {
        res[j][i]=0;
        res[j][i] += img[j-1][i-1];
        res[j][i] += img[j  ][i-1];
        res[j][i] += img[j+1][i-1];
        res[j][i] += img[j-1][i  ];
        res[j][i] += img[j  ][i  ];
        res[j][i] += img[j+1][i  ];
        res[j][i] += img[j-1][i+1];
        res[j][i] += img[j  ][i+1];
        res[j][i] += img[j+1][i+1];
        res[j][i] /= 9;
    }
}

This eliminates all the non-sequential access completely so you no longer get random slow-downs on large powers-of-two.


Core i7 920 @ 3.5 GHz

Original code:

8191: 1.499 seconds
8192: 2.122 seconds
8193: 1.582 seconds

Interchanged Outer-Loops:

8191: 0.376 seconds
8192: 0.357 seconds
8193: 0.351 seconds
57
votes

The following tests have been done with Visual C++ compiler as it is used by the default Qt Creator install (I guess with no optimization flag). When using GCC, there is no big difference between Mystical's version and my "optimized" code. So the conclusion is that compiler optimizations take care off micro optimization better than humans (me at last). I leave the rest of my answer for reference.


It's not efficient to process images this way. It's better to use single dimension arrays. Processing all pixels is the done in one loop. Random access to points could be done using:

pointer + (x + y*width)*(sizeOfOnePixel)

In this particular case, it's better to compute and cache the sum of three pixels groups horizontally because they are used three times each.

I've done some tests and I think it's worth sharing. Each result is an average of five tests.

Original code by user1615209:

8193: 4392 ms
8192: 9570 ms

Mystical's version:

8193: 2393 ms
8192: 2190 ms

Two pass using a 1D array: first pass for horizontal sums, second for vertical sum and average. Two pass addressing with three pointers and only increments like this:

imgPointer1 = &avg1[0][0];
imgPointer2 = &avg1[0][SIZE];
imgPointer3 = &avg1[0][SIZE+SIZE];

for(i=SIZE;i<totalSize-SIZE;i++){
    resPointer[i]=(*(imgPointer1++)+*(imgPointer2++)+*(imgPointer3++))/9;
}

8193: 938 ms
8192: 974 ms

Two pass using a 1D array and addressing like this:

for(i=SIZE;i<totalSize-SIZE;i++){
    resPointer[i]=(hsumPointer[i-SIZE]+hsumPointer[i]+hsumPointer[i+SIZE])/9;
}

8193: 932 ms
8192: 925 ms

One pass caching horizontal sums just one row ahead so they stay in cache:

// Horizontal sums for the first two lines
for(i=1;i<SIZE*2;i++){
    hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
}
// Rest of the computation
for(;i<totalSize;i++){
    // Compute horizontal sum for next line
    hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
    // Final result
    resPointer[i-SIZE]=(hsumPointer[i-SIZE-SIZE]+hsumPointer[i-SIZE]+hsumPointer[i])/9;
}

8193: 599 ms
8192: 652 ms

Conclusion:

  • No benefits of using several pointers and just increments (I thought it would have been faster)
  • Caching horizontal sums is better than computing them several time.
  • Two pass is not three times faster, two times only.
  • It's possible to achieve 3.6 times faster using both a single pass and caching an intermediary result

I'm sure it's possible to do much better.

NOTE Please, note that I wrote this answer to target general performance issues rather than the cache problem explained in Mystical's excellent answer. At the beginning it was just pseudo code. I was asked to do tests in the comments... Here is a completely refactored version with tests.