Is it possible to have Dynamic Parallelism feature in devices with compute capability less than 3.5?
I know that nvcc will return an error if it encounter a kernel launch in a kernel, but is it possible to declare some PTX-level function to support this feature?
For example here I have declared cudaLaunchDevice()
:
.extern .func(.param .b32 func_retval0) cudaLaunchDevice
{
.param .b32 fun,
.param .b32 parameterBuffer,
.param .align 4 .b8 gridDimension[12],
.param .align 4 .b8 blockDimension[12],
.param .b32 sharedMemSize,
.param .b32 stream
}
To hope that the CUDA-level declaration of this function in cuda_device_runtime_api.h
will be mapped to the previous PTX-level declaration. (Does the program have to be linked with something?)
//CUDA-level declaration of cudaLaunchDevice()
extern "C" __device__
cudaError_t cudaLaunchDevice(void *func, void *parameterBuffer,
dim3 gridDimension, dim3 blockDimension,
unsigned int sharedMemSize,
cudaStream_t stream);
(The codes are from CUDA C programming guide book v6.5) .