Why is cuFFT so slow?

Question

I'm hoping to accelerate a computer vision application that computes many FFTs using FFTW and OpenMP on an Intel CPU. However, for a variety of FFT problem sizes, I've found that cuFFT is slower than FFTW with OpenMP.

In the experiments and discussion below, I find that cuFFT is slower than FFTW for batched 2D FFTs. Why is cuFFT so slow, and is there anything I can do to make cuFFT run faster?

Experiments (code download)

Our computer vision application requires a forward FFT on a bunch of small planes of size 256x256. I'm running the FFTs on on HOG features with a depth of 32, so I use the batch mode to do 32 FFTs per function call. Typically, I do about 8 FFT function calls of size 256x256 with a batch size of 32.

FFTW + OpenMP
The following code executes in 16.0ms on an Intel i7-2600 8-core CPU.

int depth = 32; int nRows = 256; int nCols = 256; int nIter = 8;
int n[2] = {nRows, nCols};

//if nCols is even, cols_padded = (nCols+2). if nCols is odd, cols_padded = (nCols+1)
int cols_padded = 2*(nCols/2 + 1); //allocate this width, but tell FFTW that it's nCols width
int inembed[2] = {nRows, 2*(nCols/2 + 1)};
int onembed[2] = {nRows, (nCols/2 + 1)}; //default -- equivalent ot onembed=NULL

float* h_in = (float*)malloc(sizeof(float)*nRows*cols_padded*depth);
memset(h_in, 0, sizeof(float)*nRows*cols_padded*depth);
fftwf_complex* h_freq = reinterpret_cast<fftwf_complex*>(h_in); //in-place version

fftwf_plan forwardPlan = fftwf_plan_many_dft_r2c(2, //rank
                                                 n, //dims -- this doesn't include zero-padding
                                                 depth, //howmany
                                                 h_in, //in
                                                 inembed, //inembed
                                                 depth, //istride
                                                 1, //idist
                                                 h_freq, //out
                                                 onembed, //onembed
                                                 depth, //ostride
                                                 1, //odist
                                                 FFTW_PATIENT /*flags*/);
double start = read_timer();
#pragma omp parallel for
for(int i=0; i<nIter; i++){
    fftwf_execute_dft_r2c(forwardPlan, h_in, h_freq);
}
double responseTime = read_timer() - start;
printf("did %d FFT calls in %f ms \n", nIter, responseTime);

cuFFT
The following code executes in 21.7ms on a top-of-the-line NVIDIA K20 GPU. Note that, even if I use streams, cuFFT does not run multiple FFTs concurrently.

int depth = 32; int nRows = 256; int nCols = 256; int nIter = 8;
int n[2] = {nRows, nCols};

int cols_padded = 2*(nCols/2 + 1); //allocate this width, but tell FFTW that it's nCols width
int inembed[2] = {nRows, 2*(nCols/2 + 1)};
int onembed[2] = {nRows, (nCols/2 + 1)}; //default -- equivalent ot onembed=NULL in FFTW
cufftHandle forwardPlan;
float* d_in; cufftComplex* d_freq;
CHECK_CUFFT(cufftPlanMany(&forwardPlan,
              2, //rank
              n, //dimensions = {nRows, nCols}
              inembed, //inembed
              depth, //istride
              1, //idist
              onembed, //onembed
              depth, //ostride
              1, //odist
              CUFFT_R2C, //cufftType
              depth /*batch*/));

CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth));
d_freq = reinterpret_cast<cufftComplex*>(d_in);

double start = read_timer();
for(int i=0; i<nIter; i++){

    CHECK_CUFFT(cufftExecR2C(forwardPlan, d_in, d_freq));
}
CHECK_CUDART(cudaDeviceSynchronize());
double responseTime = read_timer() - start;
printf("did %d FFT calls in %f ms \n", nIter, responseTime);

Other notes

In the GPU version, cudaMemcpys between the CPU and GPU are not included in my computation time.
The performance numbers presented here are averages of several experiments, where each experiment has 8 FFT function calls (total of 10 experiments, so 80 FFT function calls).
I've tried many problem sizes (e.g. 128x128, 256x256, 512x512, 1024x1024), all with depth=32. Based on the nvvp profiler, some sizes like 1024x1024 are able to fully saturate the GPU. But, for all of these sizes, the CPU FFTW+OpenMP is faster than cuFFT.

What happens if you have the omp version work on separate data areas, rather than all hammering on the same h_in and h_freq variables (which is probably not sensible, even for test purposes). And to some extent, it seems like you've answered your own question. For the 256x256 case, it seems you're stating the GPU is not fully utilized. You should probably test large cases. — Robert Crovella
Now, I'm trying a larger problem size: 1024x1024x32. This saturates the GPU, according to the NVIDIA Profiler (nvvp). Now, we're at 620ms for cuFFT and 487ms for FFTW. — solvingPuzzles
Doing an out-of-place transform (h_freq and h_in are different memory locations), FFTW's performance slows down from 16.0ms to 20.2ms average for 256x256x32. cuFFT's performance is unchanged for in-place vs. out-of-place, staying at 21.7ms. — solvingPuzzles
My point on h_in and h_freq is that in your real work, you're not going to do the same fft operation 8 times on the same data. What happens if you add a private directive to your omp parallel for to require each thread to work on a private copy of h_in and h_freq, or even better make 8 data sets h_in[0..7] and h_freq[0..7]. That seems more sensible to me for test purposes. Otherwise all 8 threads are working on the same data set (both input and output) which is kinda strange. — Robert Crovella
@solvingPuzzles If the GPU is not saturated (and at the sizes you are testing at I'd be more concerned about limitations in memory than performance), then just increase the batch size to accomodate the nIter as well. — Pavan Yalamanchili

Florent DUGUET Florent DUGUET · Accepted Answer · 2016-04-26T16:47:14

Question might be outdated, though here is a possible explanation (for the slowness of cuFFT).

When structuring your data for cufftPlanMany, the data arrangement is not very nice with the GPU. Indeed, using an istride and ostride of 32 means no data read is coalesced. See here for details on the read pattern

input[b * idist + (x * inembed[1] + y) * istride]
output[b * odist + (x * onembed[1] + y) * ostride]

in which case if i/ostride is 32, it will very unlikely be coalesced/optimal. (indeed b is the batch number). Here are the changes I applied:

    CHECK_CUFFT(cufftPlanMany(&forwardPlan,
              2, //rank
              n, //dimensions = {nRows, nCols}
              inembed, //inembed
              1,  // WAS: depth, //istride
              nRows*cols_padded, // WAS: 1, //idist
              onembed, //onembed
              1, // WAS: depth, //ostride
              nRows*cols_padded, // WAS:1, //odist
              CUFFT_R2C, //cufftType
              depth /*batch*/));

Running this, I entered a unspecified launch failure because of illegal memory access. You might want to change the memory allocation (cufftComplex is two floats, you need an x2 in your allocation size - looks like a typo).

// WAS : CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth)); 
CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth*2));

When running it this way, I got a x8 performance improvement on my card.

Why is cuFFT so slow?

Experiments (code download)

1 Answers