Wrong cuFFT 2D Convolution results with non square matrix

Question

I'm trying to perform a 2D convolution using the "FFT + point_wise_product + iFFT" aproach. Using NxN matrices the method goes well, however, with non square matrices the results are not correct. I've read the whole cuFFT documentation looking for any note about the behavior with this kind of matrices, tested in-place and out-place FFT, but I'm forgetting something.

I've tested the same algorithm with the same matrices in MATLAB and everthing is correct. I show you a very simplified code, with a very basic filter for clarity, its output, and the expected output. What am I doing wrong? I've also read other related questions/answers but none of them solve the problem. Thank you very much in advance.

const int W = 5;
const int H = 4;

//Signal, with just one value, for simplicity.
float A[H][W] = {{0,0,0,0,0},
        {0,1,0,0,0},
        {0,0,0,0,0},
        {0,0,0,0,0}};

//Central element of the kernel in the (0,0) position of the array.
float B[H][W] = {{0.5, 0.1,  0,    0,  0.2},
        {0  ,  0  ,  0,    0,  0},
        {0  ,  0  ,  0,    0,  0},
        {0  ,  0  ,  0,    0,  0}};


cufftReal* d_inA, *d_inB;
cufftComplex* d_outA, *d_outB;

size_t real_size = W * H * sizeof(cufftReal);
size_t complex_size = W * (H/2+1) * sizeof(cufftComplex);

cudaMalloc((void**)&d_inA, real_size);
cudaMalloc((void**)&d_inB, real_size);

cudaMalloc((void**)&d_outA, complex_size);
cudaMalloc((void**)&d_outB, complex_size);

cudaMemset(d_inA,0,real_size);
cudaMemset(d_inB,0,real_size);

cudaMemcpy(d_inA, A, real_size, cudaMemcpyHostToDevice);
cudaMemcpy(d_inB, B, real_size, cudaMemcpyHostToDevice);


cufftHandle fwplanA, fwplanB, bwplan;
cufftPlan2d(&fwplanA, W, H, CUFFT_R2C);
cufftPlan2d(&fwplanB, W, H, CUFFT_R2C);
cufftPlan2d(&bwplan, W, H, CUFFT_C2R);

cufftSetCompatibilityMode(fwplanA,CUFFT_COMPATIBILITY_NATIVE);
cufftSetCompatibilityMode(fwplanB,CUFFT_COMPATIBILITY_NATIVE);
cufftSetCompatibilityMode(bwplan,CUFFT_COMPATIBILITY_NATIVE);   

cufftExecR2C(fwplanA, d_inA, d_outA);
cufftExecR2C(fwplanB, d_inB, d_outB);

int blocksx = ceil((W*(H/2+1 )) / 256.0f);
dim3 threads(256);
dim3 grid(blocksx);
// One complex product for each thread, scaled by the inverse of the
// number of elements involved in the FFT
pointwise_product<<<grid, threads>>>(d_outA, d_outB, (W*(H/2+1)), 1.0f/(W*H));

cufftExecC2R(bwplan, d_outA, d_inA);


cufftReal* result = new cufftReal[W*2*(H/2+1)];
cudaMemcpy(result, d_inA, real_size,cudaMemcpyDeviceToHost);

// Print result...
// Free memory...

Output. Note the displaced value

-0.0  0.0   -0.0   -0.0   0.0   
0.0   0.5    0.1    0.0   -0.0  
0.2   0.0    0.0   -0.0   -0.0  
-0.0  0.0   -0.0   -0.0   -0.0

Expected output (MATLAB)

0         0         0         0         0
0.2000    0.5000    0.1000    0.0000    0.0000
0         0         0         0         0
0         0         0         0         0

Related Doc: http://developer.download.nvidia.com/compute/cuda/2_2/sdk/website/projects/convolutionFFT2D/doc/convolutionFFT2D.pdf

Are you exchanging rows with columns in your cufft plan? The prototype is cufftPlan2d(cufftHandle *plan, int nx, int ny, cufftType type) where nx is the number of rows and ny is the number of columns, so it should be cufftPlan2d(&fwplanA, H, W, CUFFT_R2C); and not cufftPlan2d(&fwplanA, W, H, CUFFT_R2C);. — Vitality
That was the error... I can't believe I didn't see it before. I reviewed the cufft documentation and say nx=rows, ny=columns... Thank you very much! — jgonzac
I have added an answer to your question to remove this post from the unanswered list. — Vitality

Vitality Vitality · Accepted Answer · 2014-04-04T10:21:09

You are exchanging rows with columns in your cufft plan.

The relevant prototype is

 cufftPlan2d(cufftHandle *plan, int nx, int ny, cufftType type)

where nx is the number of rows and ny is the number of columns, so it should be

 cufftPlan2d(&fwplanA, H, W, CUFFT_R2C);

and not

 cufftPlan2d(&fwplanA, W, H, CUFFT_R2C);

Wrong cuFFT 2D Convolution results with non square matrix

1 Answers