Fastest Gaussian blur implementation

36

votes

How do you implement the fastest possible Gaussian blur algorithm?

I am going to implement it in Java, so GPU solutions are ruled out. My application, planetGenesis, is cross platform, so I don't want JNI.

javaimage-processingfilteringgaussian

Worth taking a look at (or using directly!) the GaussianFilter at JH Labs - jhlabs.com/ip/filters/index.html - I've used it and it is pretty fast. – mikera

Have a look here: github.com/RoyiAvital/FastGuassianBlur – Royi

31

votes

You should use the fact that a Gaussian kernel is separable, i. e. you can express a 2D convolution as a combination of two 1D convolutions.

If the filter is large, it may also make sense to use the fact that convolution in the spatial domain is equivalent to multiplication in the frequency (Fourier) domain. This means that you can take the Fourier transform of the image and the filter, multiply the (complex) results, and then take the inverse Fourier transform. The complexity of the FFT (Fast Fourier Transform) is O(n log n), while the complexity of a convolution is O(n^2). Also, if you need to blur many images with the same filter, you would only need to take the FFT of the filter once.

If you decide to go with using an FFT, the FFTW library is a good choice.

27

votes

Math jocks are likely to know this, but for anyone else..

Due to a nice mathematical propertiy of the Gaussian, you can blur a 2D image quickly by first running a 1D Gaussian blur on each row of the image, then run a 1D blur on each column.

20

votes

ULTIMATE SOLUTION

I was very confused by so much information and implementations, I didn't know which one should I trust. After I figured it out, I decided to write my own article. I hope it will save you hours of time.

Fastest Gaussian Blur (in linear time)

It contains the source code, which (I hope) is short, clean and easily rewritable to any other language. Please vote it up, so other people can see it.

17

votes

I found Quasimondo : Incubator : Processing : Fast Gaussian Blur. This method contains a lot of approximations like using integers and look up tables instead of floats and floating point divisions. I don't know how much speedup that is in modern Java code.
Fast Shadows on Rectangles has an approximating algorithm using B-splines.
Fast Gaussian Blur Algorithm in C# claims to have some cool optimizations.
Also, Fast Gaussian Blur (PDF) by David Everly has a fast method for Gaussian blur processing.

I would try out the various methods, benchmark them and post the results here.

For my purposes, I have copied and implemented the basic (process X-Y axis independently) method and David Everly's Fast Gaussian Blur method from the Internet. They differ in parameters, so I couldn't compare them directly. However the latter goes through much fewer number of iterations for a large blur radius. Also, the latter is an approximate algorithm.

8

votes

You probably want the box blur, which is much faster. See this link for a great tutorial and some copy & paste C code.

7

votes

For larger blur radiuses, try applying a box blur three times. This will approximate a Gaussian blur very well, and be much faster than a true Gaussian blur.

3

votes

I have converted Ivan Kuckir's implementation of a fast Gaussian blur which uses three passes with linear box blurs to java. The resulting process is O(n) as he has stated at his own blog. If you would like to learn more about why does 3 time box blur approximates to Gaussian blur(3%) my friend you may check out box blur and Gaussian blur.

Here is the java implementation.

@Override
public BufferedImage ProcessImage(BufferedImage image) {
    int width = image.getWidth();
    int height = image.getHeight();

    int[] pixels = image.getRGB(0, 0, width, height, null, 0, width);
    int[] changedPixels = new int[pixels.length];

    FastGaussianBlur(pixels, changedPixels, width, height, 12);

    BufferedImage newImage = new BufferedImage(width, height, image.getType());
    newImage.setRGB(0, 0, width, height, changedPixels, 0, width);

    return newImage;
}

private void FastGaussianBlur(int[] source, int[] output, int width, int height, int radius) {
    ArrayList<Integer> gaussianBoxes = CreateGausianBoxes(radius, 3);
    BoxBlur(source, output, width, height, (gaussianBoxes.get(0) - 1) / 2);
    BoxBlur(output, source, width, height, (gaussianBoxes.get(1) - 1) / 2);
    BoxBlur(source, output, width, height, (gaussianBoxes.get(2) - 1) / 2);
}

private ArrayList<Integer> CreateGausianBoxes(double sigma, int n) {
    double idealFilterWidth = Math.sqrt((12 * sigma * sigma / n) + 1);

    int filterWidth = (int) Math.floor(idealFilterWidth);

    if (filterWidth % 2 == 0) {
        filterWidth--;
    }

    int filterWidthU = filterWidth + 2;

    double mIdeal = (12 * sigma * sigma - n * filterWidth * filterWidth - 4 * n * filterWidth - 3 * n) / (-4 * filterWidth - 4);
    double m = Math.round(mIdeal);

    ArrayList<Integer> result = new ArrayList<>();

    for (int i = 0; i < n; i++) {
        result.add(i < m ? filterWidth : filterWidthU);
    }

    return result;
}

private void BoxBlur(int[] source, int[] output, int width, int height, int radius) {
    System.arraycopy(source, 0, output, 0, source.length);
    BoxBlurHorizantal(output, source, width, height, radius);
    BoxBlurVertical(source, output, width, height, radius);
}

private void BoxBlurHorizontal(int[] sourcePixels, int[] outputPixels, int width, int height, int radius) {
    int resultingColorPixel;
    float iarr = 1f / (radius + radius);
    for (int i = 0; i < height; i++) {
        int outputIndex = i * width;
        int li = outputIndex;
        int sourceIndex = outputIndex + radius;

        int fv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex]);
        int lv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex + width - 1]);
        float val = (radius) * fv;

        for (int j = 0; j < radius; j++) {
            val += Byte.toUnsignedInt((byte) (sourcePixels[outputIndex + j]));
        }

        for (int j = 0; j < radius; j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex++]) - fv;
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
        }

        for (int j = (radius + 1); j < (width - radius); j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex++]) - Byte.toUnsignedInt((byte) sourcePixels[li++]);
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
        }

        for (int j = (width - radius); j < width; j++) {
            val += lv - Byte.toUnsignedInt((byte) sourcePixels[li++]);
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
        }
    }
}

private void BoxBlurVertical(int[] sourcePixels, int[] outputPixels, int width, int height, int radius) {
    int resultingColorPixel;
    float iarr = 1f / (radius + radius + 1);
    for (int i = 0; i < width; i++) {
        int outputIndex = i;
        int li = outputIndex;
        int sourceIndex = outputIndex + radius * width;

        int fv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex]);
        int lv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex + width * (height - 1)]);
        float val = (radius + 1) * fv;

        for (int j = 0; j < radius; j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[outputIndex + j * width]);
        }
        for (int j = 0; j <= radius; j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex]) - fv;
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
            sourceIndex += width;
            outputIndex += width;
        }
        for (int j = radius + 1; j < (height - radius); j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex]) - Byte.toUnsignedInt((byte) sourcePixels[li]);
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
            li += width;
            sourceIndex += width;
            outputIndex += width;
        }
        for (int j = (height - radius); j < height; j++) {
            val += lv - Byte.toUnsignedInt((byte) sourcePixels[li]);
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
            li += width;
            outputIndex += width;
        }
    }
}

2

votes

I would consider using CUDA or some other GPU programming toolkit for this, especially if you want to use a larger kernel. Failing that, there's always hand tweaking your loops in assembly.

2

votes

Step 1: SIMD 1-dimensional Gaussian blur
Step 2: transpose
Step 3: Repeat step 1

It is best done on small blocks, as a full-image transpose is slow, while a small-block transpose can be done extremely fast using a chain of PUNPCKs (PUNPCKHBW, PUNPCKHDQ, PUNPCKHWD, PUNPCKLBW, PUNPCKLDQ, PUNPCKLWD).

2

votes

In 1D:

Blurring using almost any kernel repeatedly will tend to a Gaussian kernel. This is what's so cool about the Gaussian distribution, and is why statisticians like it. So choose something that's easy to blur with and apply it several times.

For example, it's easy to blur with a box shaped kernel. First calculate a cumulative sum:

y(i) = y(i-1) + x(i)

then:

blurred(i) = y(i+radius) - y(i-radius)

Repeat several times.

Or you might go back and forth a few times with some variety of an IIR filter, these are similarly fast.

In 2D or higher:

Blur in each dimension one after the other, as DarenW said.

2

votes

I struggled with this problem for my research and tried and interesting method for a fast Gaussian blur. First, as mentioned, it is best to separate the blur into two 1D blurs, but depending on your hardware for the actual calculation of the pixel values, you can actually pre-compute all possible values and store them in a look-up table.

In other words, pre-calculate every combination of Gaussian coefficient * input pixel value. Of course you will need to discreetize your coefficients, but I just wanted to add this solution. If you have an IEEE subscription, you can read more in Fast image blurring using Lookup Table for real time feature extraction.

Ultimately, I ended up using CUDA though :)

0

votes

There are several fast methods for gauss blur of 2d data. What you should know about.

This is separable filter , so only require two 1d convolution.
For big kernels you can process reduced copy of image and than upscale back.
Good approximation can be done by multiple box filters (also separable), (can be tuned number of iterations and kernel sizes)
Exist O(n) complexity algorithm (for any kernel size) for precise gauss approximation by IIR filter.

Your choice is depend from required speed, precision, and implementation complexity.

0

votes

Try using Box Blur the way I did here: Approximating Gaussian Blur Using Extended Box Blur

This is the best approximation.

Using Integral Images you can make it even faster.
If you do, Please share your solution.

0

votes

Answering this old question with new libraries that have been implemented now(as of 2016), because there are many new advancements to the GPU technology with Java.

As suggested in few other answers, CUDA is an alternative. But java has CUDA support now.

IBM CUDA4J library: provides a Java API for managing and accessing GPU devices, libraries, kernels, and memory. Using these new APIs it is possible to write Java programs that manage GPU device characteristics and offload work to the GPU with the convenience of the Java memory model, exceptions, and automatic resource management.

Jcuda: Java bindings for NVIDIA CUDA and related libraries. With JCuda it is possible to interact with the CUDA runtime and driver API from Java programs.

Aparapi: allows Java developers to take advantage of the compute power of GPU and APU devices by executing data parallel code fragments on the GPU rather than being confined to the local CPU.

Some Java OpenCL binding libraries

https://github.com/ochafik/JavaCL : Java bindings for OpenCL: An object-oriented OpenCL library, based on auto-generated low-level bindings

http://jogamp.org/jocl/www/ : Java bindings for OpenCL: An object-oriented OpenCL library, based on auto-generated low-level bindings

http://www.lwjgl.org/ : Java bindings for OpenCL: Auto-generated low-level bindings and object-oriented convenience classes

http://jocl.org/ : Java bindings for OpenCL: Low-level bindings that are a 1:1 mapping of the original OpenCL API

All these above libraries will help in implementing Gaussian Blur faster than any implementation in Java on CPU.

0

votes

I have seen several answers in different places and am collecting them here so that I can try to wrap my mind around them and remember them for later:

Regardless of which approach you use, filter horizontal and vertical dimensions separately with 1D filters rather than using a single square filter.

The standard "slow" approach: convolution filter
Hierarchical pyramid of images with reduced resolution as in SIFT
Repeated box blurs motivated by the Central Limit Theorem. The Box Blur is central to Viola and Jones's face detection where they call it an integral image if I remember correctly. I think Haar-like features use something similar, too.
Stack Blur: a queue-based alternative somewhere between the convolutional and box blur approaches
IIR filters
- Derich filter (Wikipedia) 2nd order IIR filter
- van Vliet filter I know nothing about this one
- Bessel filters although there is some debate about these

After reviewing all of these, I'm reminded that simple, poor approximations often work well in practice. In a different field, Alex Krizhevsky found ReLU's to be faster than the classic sigmoid function in his ground-breaking AlexNet, even though they appear at first glance to be a terrible approximation to the Sigmoid.

0

votes

Dave Hale from CWP has a minejtk package, which includes recursive Gaussian filter (Deriche method and Van Vliet method). The java subroutine can be found at https://github.com/dhale/jtk/blob/0350c23f91256181d415ea7369dbd62855ac4460/core/src/main/java/edu/mines/jtk/dsp/RecursiveGaussianFilter.java

Deriche's method seems to be a very good one for Gaussian blur (and also for Gaussian's derivatives).

Fastest Gaussian blur implementation

16 Answers