
In one application, I've got a bunch of CUDA kernels. Some use dynamic parallelism and some don't. For the purposes of either providing a fallback option if this is not supported, or simply allowing the application to continue but with reduced/partially available features, how can I go about compiling?

At the moment I'm getting invalid device function when running kernels compiled with -arch=sm_35 on a 670 (max sm_30) that don't require compute 3.5.

AFAIK you can't use multiple -arch=sm_* arguments and using multiple -gencode=* doesn't help. Also for separable compilation I've had to create an additional object file using -dlink, but this doesn't get created when using compute 3.0 (nvlink fatal : no candidate found in fatbinary due to -lcudadevrt, which I've needed for 3.5), how should I deal with this?

Your question is similar to this one. AFAIK there is no simple clean solution for this at the moment. It's possible to do if you de-construct the nvcc build sequence, but I'm not going to go into the details of that. I believe when CUDA 6 is available, it will no longer throw an error when linking cudadevrt against pre-cc3.5 code that otherwise does not attempt to use dynamic parallelism, and then this problem will be straightforward to solve. CUDA 6 should be available soon.Robert Crovella

2 Answers


I believe this issue has been addressed now in CUDA 6.

Here's my simple test:

$ cat t264.cu
#include <stdio.h>

__global__ void kernel1(){
  printf("Hello from DP Kernel\n");

__global__ void kernel2(){

#if __CUDA_ARCH__ >= 350
  printf("Hello from non-DP Kernel\n");

int main(){

  return 0;
$ nvcc -O3 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_35,code=sm_35 -rdc=true -o t264 t264.cu -lcudadevrt
Hello from non-DP Kernel
Hello from DP Kernel
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Sat_Jan_25_17:33:19_PST_2014
Cuda compilation tools, release 6.0, V6.0.1

In my case, device 0 is a Quadro5000, a cc 2.0 device, and device 1 is a GeForce GT 640, a cc 3.5 device.


I don't believe there is a way to do this using the runtime API as of CUDA 5.5.

The only way I can think of to get around the problem is to use the driver API to perform your own architecture selection and load code from different cubin files at runtime. The APIs can be safely mixed, so it is only the context establishment-device selection-module load phase which needs to be done with the driver API. You can use the runtime API after that - you will need a little bit of homemade syntactic sugar for the kernel launches, but otherwise no code changes are required in other runtime API code.