I've recently gotten my head around how NVCC compiles CUDA device code for different compute architectures.
From my understanding, when using NVCC's -gencode option, "arch" is the minimum compute architecture required by the programmer's application, and also the minimum device compute architecture that NVCC's JIT compiler will compile PTX code for.
I also understand that the "code" parameter of -gencode is the compute architecture which NVCC completely compiles the application for, such that no JIT compilation is necessary.
After inspection of various CUDA project Makefiles, I've noticed the following occur regularly:
-gencode arch=compute_20,code=sm_20
-gencode arch=compute_20,code=sm_21
-gencode arch=compute_21,code=sm_21
and after some reading, I found that multiple device architectures could be compiled for in a single binary file - in this case sm_20, sm_21.
My questions are why are so many arch / code pairs necessary? Are all values of "arch" used in the above?
what is the difference between that and say:
-arch compute_20
-code sm_20
-code sm_21
Is the earliest virtual architecture in the "arch" fields selected automatically, or is there some other obscure behaviour?
Is there any other compilation and runtime behaviour I should be aware of?
I've read the manual, http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-compilation and I'm still not clear regarding what happens at compilation or runtime.