I'm trying to perform a 2D convolution using the "FFT + point_wise_product + iFFT" aproach. Using NxN matrices the method goes well, however, with non square matrices the results are not correct. I've read the whole cuFFT documentation looking for any note about the behavior with this kind of matrices, tested in-place and out-place FFT, but I'm forgetting something.
I've tested the same algorithm with the same matrices in MATLAB and everthing is correct. I show you a very simplified code, with a very basic filter for clarity, its output, and the expected output. What am I doing wrong? I've also read other related questions/answers but none of them solve the problem. Thank you very much in advance.
const int W = 5;
const int H = 4;
//Signal, with just one value, for simplicity.
float A[H][W] = {{0,0,0,0,0},
{0,1,0,0,0},
{0,0,0,0,0},
{0,0,0,0,0}};
//Central element of the kernel in the (0,0) position of the array.
float B[H][W] = {{0.5, 0.1, 0, 0, 0.2},
{0 , 0 , 0, 0, 0},
{0 , 0 , 0, 0, 0},
{0 , 0 , 0, 0, 0}};
cufftReal* d_inA, *d_inB;
cufftComplex* d_outA, *d_outB;
size_t real_size = W * H * sizeof(cufftReal);
size_t complex_size = W * (H/2+1) * sizeof(cufftComplex);
cudaMalloc((void**)&d_inA, real_size);
cudaMalloc((void**)&d_inB, real_size);
cudaMalloc((void**)&d_outA, complex_size);
cudaMalloc((void**)&d_outB, complex_size);
cudaMemset(d_inA,0,real_size);
cudaMemset(d_inB,0,real_size);
cudaMemcpy(d_inA, A, real_size, cudaMemcpyHostToDevice);
cudaMemcpy(d_inB, B, real_size, cudaMemcpyHostToDevice);
cufftHandle fwplanA, fwplanB, bwplan;
cufftPlan2d(&fwplanA, W, H, CUFFT_R2C);
cufftPlan2d(&fwplanB, W, H, CUFFT_R2C);
cufftPlan2d(&bwplan, W, H, CUFFT_C2R);
cufftSetCompatibilityMode(fwplanA,CUFFT_COMPATIBILITY_NATIVE);
cufftSetCompatibilityMode(fwplanB,CUFFT_COMPATIBILITY_NATIVE);
cufftSetCompatibilityMode(bwplan,CUFFT_COMPATIBILITY_NATIVE);
cufftExecR2C(fwplanA, d_inA, d_outA);
cufftExecR2C(fwplanB, d_inB, d_outB);
int blocksx = ceil((W*(H/2+1 )) / 256.0f);
dim3 threads(256);
dim3 grid(blocksx);
// One complex product for each thread, scaled by the inverse of the
// number of elements involved in the FFT
pointwise_product<<<grid, threads>>>(d_outA, d_outB, (W*(H/2+1)), 1.0f/(W*H));
cufftExecC2R(bwplan, d_outA, d_inA);
cufftReal* result = new cufftReal[W*2*(H/2+1)];
cudaMemcpy(result, d_inA, real_size,cudaMemcpyDeviceToHost);
// Print result...
// Free memory...
Output. Note the displaced value
-0.0 0.0 -0.0 -0.0 0.0
0.0 0.5 0.1 0.0 -0.0
0.2 0.0 0.0 -0.0 -0.0
-0.0 0.0 -0.0 -0.0 -0.0
Expected output (MATLAB)
0 0 0 0 0
0.2000 0.5000 0.1000 0.0000 0.0000
0 0 0 0 0
0 0 0 0 0
Related Doc: http://developer.download.nvidia.com/compute/cuda/2_2/sdk/website/projects/convolutionFFT2D/doc/convolutionFFT2D.pdf
cufftPlan2d(cufftHandle *plan, int nx, int ny, cufftType type)
wherenx
is the number of rows andny
is the number of columns, so it should becufftPlan2d(&fwplanA, H, W, CUFFT_R2C);
and notcufftPlan2d(&fwplanA, W, H, CUFFT_R2C);
. – Vitality