Dynamic Parallelism and CC 2.0 code in the same library

Question

In my library I need to support devices of compute capability 2.0 and higher. For CC 3.5+ devices I’ve implemented optimized kernels which utilize Dynamic Parallelism. It seems that nvcc compiler does not support DP when anything less than “compute_35,sm_35” is specified (I'm getting compiler/linker errors). My question is what is the best way to support multiple kernel versions in such case? Having multiple DLLs and choosing between them at runtime will work but I was wondering if there is a better way.

UPDATE: I’m successfully using #if __CUDA_ARCH__ >= 350 for other things (like __ldg() etc) but it does not work in DP case as I have to link with cudadevrt.lib which produces the following error:

1>nvlink : fatal error : could not find compatible device code in C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v5.5/lib/Win32/cudadevrt.lib

You can dynamically set your linker dependencies with pragma: #pragma comment(lib, ...). — Roger Dahl
@RogerDahl: I have two problems with this solution: 1. __CUDA_ARCH__ is defined only for device code. 2. For some reason, #pragma comment(lib, ) does not work for that particular library, cudadevrt.lib). That is, if I replace it with, say, cudart.lib then #pragma works just fine, but for cudadevrt.lib I'm getting errors like 1>nvlink : error : Undefined reference to 'cudaLaunchDevice' in 'Win32/Debug/cdpSimplePrint.cu.obj' — Alexey Kamenev
Have you solved your problem following Roger Dahl's comment? If so, could you post an answer to your own question? This is perfectly legal on StackOverflow and will help improving the CUDA tag by removing this post from the unanswered list. — Vitality
Roger's link is about different problem. That issue is not related to having both pre-3.5 and 3.5-with-DP code in the same DLL. I'm now waiting for a response from my NVIDIA contact and will post an update here once I get it. — Alexey Kamenev

Robert Crovella Robert Crovella · Accepted Answer · 2013-11-11T20:44:53

I believe this issue has been addressed now in CUDA 6.

In particular, the compile problem associated with having the -lcudadevrt library specified and throwing a link error for code that is not requiring dynamic parallelism, has been eliminated/removed.

Here's my simple test:

$ cat t264.cu
#include <stdio.h>

__global__ void kernel1(){
  printf("Hello from DP Kernel\n");
}

__global__ void kernel2(){

#if __CUDA_ARCH__ >= 350
  kernel1<<<1,1>>>();
#else
  printf("Hello from non-DP Kernel\n");
#endif
}

int main(){

  kernel2<<<1,1>>>();
  cudaDeviceSynchronize();
  return 0;
}
$ nvcc -O3 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_35,code=sm_35 -rdc=true -o t264 t264.cu -lcudadevrt
$ CUDA_VISIBLE_DEVICES="0" ./t264
Hello from non-DP Kernel
$ CUDA_VISIBLE_DEVICES="1" ./t264
Hello from DP Kernel
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Sat_Jan_25_17:33:19_PST_2014
Cuda compilation tools, release 6.0, V6.0.1
$

In my case, device 0 is a Quadro5000, a cc 2.0 device, and device 1 is a GeForce GT 640, a cc 3.5 device.

Dynamic Parallelism and CC 2.0 code in the same library

1 Answers