How do you implement the fastest possible Gaussian blur algorithm?
I am going to implement it in Java, so GPU solutions are ruled out. My application, planetGenesis, is cross platform, so I don't want JNI.
How do you implement the fastest possible Gaussian blur algorithm?
I am going to implement it in Java, so GPU solutions are ruled out. My application, planetGenesis, is cross platform, so I don't want JNI.
You should use the fact that a Gaussian kernel is separable, i. e. you can express a 2D convolution as a combination of two 1D convolutions.
If the filter is large, it may also make sense to use the fact that convolution in the spatial domain is equivalent to multiplication in the frequency (Fourier) domain. This means that you can take the Fourier transform of the image and the filter, multiply the (complex) results, and then take the inverse Fourier transform. The complexity of the FFT (Fast Fourier Transform) is O(n log n), while the complexity of a convolution is O(n^2). Also, if you need to blur many images with the same filter, you would only need to take the FFT of the filter once.
If you decide to go with using an FFT, the FFTW library is a good choice.
ULTIMATE SOLUTION
I was very confused by so much information and implementations, I didn't know which one should I trust. After I figured it out, I decided to write my own article. I hope it will save you hours of time.
Fastest Gaussian Blur (in linear time)
It contains the source code, which (I hope) is short, clean and easily rewritable to any other language. Please vote it up, so other people can see it.
I found Quasimondo : Incubator : Processing : Fast Gaussian Blur. This method contains a lot of approximations like using integers and look up tables instead of floats and floating point divisions. I don't know how much speedup that is in modern Java code.
Fast Shadows on Rectangles has an approximating algorithm using B-splines.
Fast Gaussian Blur Algorithm in C# claims to have some cool optimizations.
Also, Fast Gaussian Blur (PDF) by David Everly has a fast method for Gaussian blur processing.
I would try out the various methods, benchmark them and post the results here.
For my purposes, I have copied and implemented the basic (process X-Y axis independently) method and David Everly's Fast Gaussian Blur method from the Internet. They differ in parameters, so I couldn't compare them directly. However the latter goes through much fewer number of iterations for a large blur radius. Also, the latter is an approximate algorithm.
You probably want the box blur, which is much faster. See this link for a great tutorial and some copy & paste C code.
For larger blur radiuses, try applying a box blur three times. This will approximate a Gaussian blur very well, and be much faster than a true Gaussian blur.
I have converted Ivan Kuckir's implementation of a fast Gaussian blur which uses three passes with linear box blurs to java. The resulting process is O(n) as he has stated at his own blog. If you would like to learn more about why does 3 time box blur approximates to Gaussian blur(3%) my friend you may check out box blur and Gaussian blur.
Here is the java implementation.
@Override
public BufferedImage ProcessImage(BufferedImage image) {
int width = image.getWidth();
int height = image.getHeight();
int[] pixels = image.getRGB(0, 0, width, height, null, 0, width);
int[] changedPixels = new int[pixels.length];
FastGaussianBlur(pixels, changedPixels, width, height, 12);
BufferedImage newImage = new BufferedImage(width, height, image.getType());
newImage.setRGB(0, 0, width, height, changedPixels, 0, width);
return newImage;
}
private void FastGaussianBlur(int[] source, int[] output, int width, int height, int radius) {
ArrayList<Integer> gaussianBoxes = CreateGausianBoxes(radius, 3);
BoxBlur(source, output, width, height, (gaussianBoxes.get(0) - 1) / 2);
BoxBlur(output, source, width, height, (gaussianBoxes.get(1) - 1) / 2);
BoxBlur(source, output, width, height, (gaussianBoxes.get(2) - 1) / 2);
}
private ArrayList<Integer> CreateGausianBoxes(double sigma, int n) {
double idealFilterWidth = Math.sqrt((12 * sigma * sigma / n) + 1);
int filterWidth = (int) Math.floor(idealFilterWidth);
if (filterWidth % 2 == 0) {
filterWidth--;
}
int filterWidthU = filterWidth + 2;
double mIdeal = (12 * sigma * sigma - n * filterWidth * filterWidth - 4 * n * filterWidth - 3 * n) / (-4 * filterWidth - 4);
double m = Math.round(mIdeal);
ArrayList<Integer> result = new ArrayList<>();
for (int i = 0; i < n; i++) {
result.add(i < m ? filterWidth : filterWidthU);
}
return result;
}
private void BoxBlur(int[] source, int[] output, int width, int height, int radius) {
System.arraycopy(source, 0, output, 0, source.length);
BoxBlurHorizantal(output, source, width, height, radius);
BoxBlurVertical(source, output, width, height, radius);
}
private void BoxBlurHorizontal(int[] sourcePixels, int[] outputPixels, int width, int height, int radius) {
int resultingColorPixel;
float iarr = 1f / (radius + radius);
for (int i = 0; i < height; i++) {
int outputIndex = i * width;
int li = outputIndex;
int sourceIndex = outputIndex + radius;
int fv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex]);
int lv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex + width - 1]);
float val = (radius) * fv;
for (int j = 0; j < radius; j++) {
val += Byte.toUnsignedInt((byte) (sourcePixels[outputIndex + j]));
}
for (int j = 0; j < radius; j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex++]) - fv;
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
}
for (int j = (radius + 1); j < (width - radius); j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex++]) - Byte.toUnsignedInt((byte) sourcePixels[li++]);
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
}
for (int j = (width - radius); j < width; j++) {
val += lv - Byte.toUnsignedInt((byte) sourcePixels[li++]);
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
}
}
}
private void BoxBlurVertical(int[] sourcePixels, int[] outputPixels, int width, int height, int radius) {
int resultingColorPixel;
float iarr = 1f / (radius + radius + 1);
for (int i = 0; i < width; i++) {
int outputIndex = i;
int li = outputIndex;
int sourceIndex = outputIndex + radius * width;
int fv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex]);
int lv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex + width * (height - 1)]);
float val = (radius + 1) * fv;
for (int j = 0; j < radius; j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[outputIndex + j * width]);
}
for (int j = 0; j <= radius; j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex]) - fv;
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
sourceIndex += width;
outputIndex += width;
}
for (int j = radius + 1; j < (height - radius); j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex]) - Byte.toUnsignedInt((byte) sourcePixels[li]);
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
li += width;
sourceIndex += width;
outputIndex += width;
}
for (int j = (height - radius); j < height; j++) {
val += lv - Byte.toUnsignedInt((byte) sourcePixels[li]);
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
li += width;
outputIndex += width;
}
}
}
It is best done on small blocks, as a full-image transpose is slow, while a small-block transpose can be done extremely fast using a chain of PUNPCKs (PUNPCKHBW, PUNPCKHDQ, PUNPCKHWD, PUNPCKLBW, PUNPCKLDQ, PUNPCKLWD).
In 1D:
Blurring using almost any kernel repeatedly will tend to a Gaussian kernel. This is what's so cool about the Gaussian distribution, and is why statisticians like it. So choose something that's easy to blur with and apply it several times.
For example, it's easy to blur with a box shaped kernel. First calculate a cumulative sum:
y(i) = y(i-1) + x(i)
then:
blurred(i) = y(i+radius) - y(i-radius)
Repeat several times.
Or you might go back and forth a few times with some variety of an IIR filter, these are similarly fast.
In 2D or higher:
Blur in each dimension one after the other, as DarenW said.
I struggled with this problem for my research and tried and interesting method for a fast Gaussian blur. First, as mentioned, it is best to separate the blur into two 1D blurs, but depending on your hardware for the actual calculation of the pixel values, you can actually pre-compute all possible values and store them in a look-up table.
In other words, pre-calculate every combination of Gaussian coefficient
* input pixel value
. Of course you will need to discreetize your coefficients, but I just wanted to add this solution. If you have an IEEE subscription, you can read more in Fast image blurring using Lookup Table for real time feature extraction.
Ultimately, I ended up using CUDA though :)
There are several fast methods for gauss blur of 2d data. What you should know about.
Your choice is depend from required speed, precision, and implementation complexity.
Try using Box Blur the way I did here: Approximating Gaussian Blur Using Extended Box Blur
This is the best approximation.
Using Integral Images you can make it even faster.
If you do, Please share your solution.
Answering this old question with new libraries that have been implemented now(as of 2016), because there are many new advancements to the GPU technology with Java.
As suggested in few other answers, CUDA is an alternative. But java has CUDA support now.
IBM CUDA4J library: provides a Java API for managing and accessing GPU devices, libraries, kernels, and memory. Using these new APIs it is possible to write Java programs that manage GPU device characteristics and offload work to the GPU with the convenience of the Java memory model, exceptions, and automatic resource management.
Jcuda: Java bindings for NVIDIA CUDA and related libraries. With JCuda it is possible to interact with the CUDA runtime and driver API from Java programs.
Aparapi: allows Java developers to take advantage of the compute power of GPU and APU devices by executing data parallel code fragments on the GPU rather than being confined to the local CPU.
Some Java OpenCL binding libraries
https://github.com/ochafik/JavaCL : Java bindings for OpenCL: An object-oriented OpenCL library, based on auto-generated low-level bindings
http://jogamp.org/jocl/www/ : Java bindings for OpenCL: An object-oriented OpenCL library, based on auto-generated low-level bindings
http://www.lwjgl.org/ : Java bindings for OpenCL: Auto-generated low-level bindings and object-oriented convenience classes
http://jocl.org/ : Java bindings for OpenCL: Low-level bindings that are a 1:1 mapping of the original OpenCL API
All these above libraries will help in implementing Gaussian Blur faster than any implementation in Java on CPU.
I have seen several answers in different places and am collecting them here so that I can try to wrap my mind around them and remember them for later:
Regardless of which approach you use, filter horizontal and vertical dimensions separately with 1D filters rather than using a single square filter.
After reviewing all of these, I'm reminded that simple, poor approximations often work well in practice. In a different field, Alex Krizhevsky found ReLU's to be faster than the classic sigmoid function in his ground-breaking AlexNet, even though they appear at first glance to be a terrible approximation to the Sigmoid.
Dave Hale from CWP has a minejtk package, which includes recursive Gaussian filter (Deriche method and Van Vliet method). The java subroutine can be found at https://github.com/dhale/jtk/blob/0350c23f91256181d415ea7369dbd62855ac4460/core/src/main/java/edu/mines/jtk/dsp/RecursiveGaussianFilter.java
Deriche's method seems to be a very good one for Gaussian blur (and also for Gaussian's derivatives).