Compiling CUDA with dynamic parallelism fallback - multiple architectures/compute capability

Question

In one application, I've got a bunch of CUDA kernels. Some use dynamic parallelism and some don't. For the purposes of either providing a fallback option if this is not supported, or simply allowing the application to continue but with reduced/partially available features, how can I go about compiling?

At the moment I'm getting invalid device function when running kernels compiled with -arch=sm_35 on a 670 (max sm_30) that don't require compute 3.5.

AFAIK you can't use multiple -arch=sm_* arguments and using multiple -gencode=* doesn't help. Also for separable compilation I've had to create an additional object file using -dlink, but this doesn't get created when using compute 3.0 (nvlink fatal : no candidate found in fatbinary due to -lcudadevrt, which I've needed for 3.5), how should I deal with this?

Your question is similar to this one. AFAIK there is no simple clean solution for this at the moment. It's possible to do if you de-construct the nvcc build sequence, but I'm not going to go into the details of that. I believe when CUDA 6 is available, it will no longer throw an error when linking cudadevrt against pre-cc3.5 code that otherwise does not attempt to use dynamic parallelism, and then this problem will be straightforward to solve. CUDA 6 should be available soon. — Robert Crovella

Robert Crovella Robert Crovella · Accepted Answer · 2014-02-20T20:22:51

I believe this issue has been addressed now in CUDA 6.

Here's my simple test:

$ cat t264.cu
#include <stdio.h>

__global__ void kernel1(){
  printf("Hello from DP Kernel\n");
}

__global__ void kernel2(){

#if __CUDA_ARCH__ >= 350
  kernel1<<<1,1>>>();
#else
  printf("Hello from non-DP Kernel\n");
#endif
}

int main(){

  kernel2<<<1,1>>>();
  cudaDeviceSynchronize();
  return 0;
}
$ nvcc -O3 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_35,code=sm_35 -rdc=true -o t264 t264.cu -lcudadevrt
$ CUDA_VISIBLE_DEVICES="0" ./t264
Hello from non-DP Kernel
$ CUDA_VISIBLE_DEVICES="1" ./t264
Hello from DP Kernel
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Sat_Jan_25_17:33:19_PST_2014
Cuda compilation tools, release 6.0, V6.0.1
$

In my case, device 0 is a Quadro5000, a cc 2.0 device, and device 1 is a GeForce GT 640, a cc 3.5 device.

Compiling CUDA with dynamic parallelism fallback - multiple architectures/compute capability

2 Answers