On my application I need to transform each line of an image, apply a filter and transform it back.
I want to be able to make multiple FFT at the same time using the GPU. More precisely, I'm using NVIDIA's CUDA. Now, some considerations:
- CUDA's FFT library, CUFFT is only able to make calls from the host ( https://devtalk.nvidia.com/default/topic/523177/cufft-device-callable-library/).
- On this topic (running FFTW on GPU vs using CUFFT), Robert Corvella says
"cufft routines can be called by multiple host threads".
I believed that doing all this FFTs in parallel would increase performance, but Robert comments
"the FFT operations are of reasonably large size, then just calling the cufft library routines as indicated should give you good speedup and approximately fully utilize the machine"
So, Is this it? Is there no gain in performing more than one FFT at a time?
Is there any library that supports calls from the device?
Shoud I just use cufftPlanMany() instead (as refered in "is-there-a-method-of-fft-that-will-run-inside-cuda-kernel" by hang or as referred in the previous topic, by Robert)?
Or the best option is to call mutiple host threads?
(this 2 links limit is killing me...)
My objective is to get some discussion on what's the best solution to this problem, since many have faced similar situations. This might be obsolete once NVIDIA implements device calls on CUFFT. (something they said they are working on but there is no expected date for the release - something said on the discussion at the NVIDIA forum (first link))