Fast Gaussian Blur image filter with ARM NEON

Question

I'm trying to make a mobile fast version of Gaussian Blur image filter.

I've read other questions, like: Fast Gaussian blur on unsigned char image- ARM Neon Intrinsics- iOS Dev

For my purpose i need only a fixed size (7x7) fixed sigma (2) Gaussian filter.

So, before optimizing for ARM NEON, I'm implementing 1D Gaussian Kernel in C++, and comparing performance with OpenCV GaussianBlur() method directly in mobile environment (Android with NDK). This way it will result in a much simpler code to optimize.

However the result is that my implementation is 10 times slower then OpenCV4Android version. I've read that OpenCV4 Tegra have optimized GaussianBlur implementation, but I don't think that standard OpenCV4Android have those kind of optimizations, so why is my code so slow?

Here is my implementation (note: reflect101 is used for pixel reflection when applying filter near borders):

Mat myGaussianBlur(Mat src){
    Mat dst(src.rows, src.cols, CV_8UC1);
    Mat temp(src.rows, src.cols, CV_8UC1);
    float sum, x1, y1;

    // coefficients of 1D gaussian kernel with sigma = 2
    double coeffs[] = {0.06475879783, 0.1209853623, 0.1760326634, 0.1994711402, 0.1760326634, 0.1209853623, 0.06475879783};
    //Normalize coeffs
    float coeffs_sum = 0.9230247873f;
    for (int i = 0; i < 7; i++){
        coeffs[i] /= coeffs_sum;
    }

    // filter vertically
    for(int y = 0; y < src.rows; y++){
        for(int x = 0; x < src.cols; x++){
            sum = 0.0;
            for(int i = -3; i <= 3; i++){
                y1 = reflect101(src.rows, y - i);
                sum += coeffs[i + 3]*src.at<uchar>(y1, x);
            }
            temp.at<uchar>(y,x) = sum;
        }
    }

    // filter horizontally
    for(int y = 0; y < src.rows; y++){
        for(int x = 0; x < src.cols; x++){
            sum = 0.0;
            for(int i = -3; i <= 3; i++){
                x1 = reflect101(src.rows, x - i);
                sum += coeffs[i + 3]*temp.at<uchar>(y, x1);
            }
            dst.at<uchar>(y,x) = sum;
        }
    }

    return dst;
}

EDIT: there is probably a little error in the reflect101, their argument should be (src.cols, y+i) and (src.rows, x+i). — Alessandro Gaietta

sh1 sh1 · Accepted Answer · 2013-07-05T17:31:04

A big part of the problem, here, is that the algorithm is overly precise, as @PaulR pointed out. It's usually best to keep your coefficient table no more precise than your data. In this case, since you appear to be processing uchar data, you would use roughly an 8-bit coefficient table.

Keeping these weights small will particularly matter in your NEON implementation because the narrower you have the arithmetic, the more lanes you can process at once.

Beyond that, the first major slowdown that stands out is that having the image edge reflection code within the main loop. That's going to make the bulk of the work less efficient because it will generally not need to do anything special in that case.

It might work out better if you use a special version of the loop near the edges, and then when you're safe from that you use a simplified inner loop that doesn't call that reflect101() function.

Second (more relevant to prototype code) is that it's possible to add the wings of the window together before applying the weighting function, because the table contains the same coefficients on both sides.

sum = src.at<uchar>(y1, x) * coeffs[3];
for(int i = -3; i < 0; i++) {
    int tmp = src.at<uchar>(y + i, x) + src.at<uchar>(y - i, x);
    sum += coeffs[i + 3] * tmp;
}

This saves you six multiplies per pixel, and it's a step towards some other optimisations around controlling overflow conditions.

Then there are a couple of other problems related to the memory system.

The two-pass approach is good in principle, because it saves you from performing a lot of recomputation. Unfortunately it can push the useful data out of L1 cache, which can make everything quite a lot slower. It also means that when you write the result out to memory, you're quantising the intermediate sum, which can reduce precision.

When you convert this code to NEON, one of the things you will want to focus on is trying to keep your working set inside the register file, but without discarding calculations before they've been fully utilised.

When people do use two passes, it's usual for the intermediate data to be transposed -- that is, a column of input becomes a row of output.

This is because the CPU will really not like fetching small amounts of data across multiple lines of the input image. It works out much more efficient (because of the way the cache works) if you collect together a bunch of horizontal pixels, and filter those. If the temporary buffer is transposed, then the second pass also collects together a bunch of horizontal points (which would vertical in the original orientation) and it transposes its output again so it comes out the right way.

If you optimise to keep your working set localised, then you might not need this transposition trick, but it's worth knowing about so that you can set yourself a healthy baseline performance. Unfortunately, localisation like this does force you to go back to the non-optimal memory fetches, but with the wider data types that penalty can be mitigated.

Fast Gaussian Blur image filter with ARM NEON

4 Answers