CUDA nvcc - build with local card max compute capablity

Question

I can specify to the cuda nvcc compiler the compute capability, and the default is 2.0: -gencode=arch=compute_20,code=\"sm_20,compute_20\".

I have two computers. One can do compute_20, the other can do compute_30. I am using visual studio. Is there away to specify to nvcc to use the maximum local card capability? Otherwise, I would need to have a separate project (.vcxproj) on each computer (specifying the max compute capability manually), which isn't ideal.

Yes, it's possible. If you look at one of the cuda sample projects, you will see how to specify compilation for multiple targets. Basically you can use multiple gencode switches in the same nvcc compile command. — Robert Crovella
I see that in the samples they set the arch to everything: compute_11,sm_11;compute_20,sm_20;compute_30,sm_30;compute_35,sm_35;compute_37,sm_37;compute_50,sm_50; So the code would compile all options. Is the driver smart enough to use the highest one? Is there a way to verify that (which code is picked in runtime)? — Zohar Levi
I posted a duplicate thread at: devtalk.nvidia.com/default/topic/883273/… — Zohar Levi

Robert Crovella Robert Crovella · Accepted Answer · 2015-10-07T17:59:54

Yes, you can specify multiple targets. The CUDA sample codes give examples of how to do this in a Visual Studio project. The basic idea would be to specify multiple -gencode switches (on the nvcc compile command line) via VS project settings under project...CUDA...device (this can also be specified on a source file-by-file basis). In Visual Studio, you just specify switch parameters, like:

 compute_20,sm_20;compute_30,sm_30;compute_35,sm_35;

and the visual studio cuda-enabled build system will convert that to a sequence of gencode switches like:

-gencode arch=compute20,code=sm_20 -gencode arch=compute_30,code=sm_30 ...

which the nvcc compiler will recognize and generate separate device code for the various targets specified. This is a fairly complicated subject, so you may want to read about the fatbinary system and nvcc compilation flow in the nvcc manual, or study other questions about it on the cuda tag here on SO like this one.

Anticipating some of your other questions, that are also covered in the nvcc manual:

The CUDA runtime will select the best fit for the actual device, based on the available targets in your fatbinary. If an exact SASS compiled binary exists, it will use that, otherwise it will take the closest PTX object and JIT-compile for the intended device.
The __CUDA_ARCH__ macro exists and is defined in device code. You could use it to specialize device code for various targets, which would give you a tedious mechanism to verify that the CUDA runtime did the expected thing in selection of objects for use.

CUDA nvcc - build with local card max compute capablity

1 Answers