Determining shared memory usage in CUDA Fortran

Question

I've been writing some basic CUDA Fortran code. I would like to be able to determine the amount of shared memory my program uses per thread block (for occupancy calculation). I have been compiling with -Mcuda=ptxinfo in the hope of finding this information. The compilation output ends with

ptxas info : Function properties for device_procedures_main_kernel_ 432 bytes stack frame, 1128 bytes spill stores, 604 bytes spill loads ptxas info : Used 63 registers, 96 bytes smem, 320 bytes cmem[0]

which is the only place in the output that smem is mentioned. There is one array in the global subroutine main_kernel with the shared attribute. If I remove the shared attribute then I get

ptxas info : Function properties for device_procedures_main_kernel_ 432 bytes stack frame, 1124 bytes spill stores, 532 bytes spill loads ptxas info : Used 63 registers, 320 bytes cmem[0]

The smem has disappeared. It seems that only shared memory in main_kernel is being counted: device subroutines in my code use variables with the shared attribute but these don't appear to be mentioned in the output e.g the device subroutine evalfuncs includes shared variable declarations but the relevant output is

ptxas info : Function properties for device_procedures_evalfuncs_ 504 bytes stack frame, 1140 bytes spill stores, 508 bytes spill loads

Do all variables with the shared attribute need to be declared in a global subroutine?

Robert Crovella Robert Crovella · Accepted Answer · 2014-12-20T19:23:27

Do all variables with the shared attribute need to be declared in a global subroutine?

No.

You haven't shown an example code, your compile command, nor have you identified the version of the PGI compiler tools you are using. However, the most likely explanation I can think of for what you are seeing is that as of PGI 14.x, the default CUDA compile option is to generate relocatable device code. This is documented in section 2.2.3 of the current PGI release notes:

2.2.3. Relocatable Device Code An rdc option is available for the –ta=tesla and –Mcuda flags that specifies to generate relocatable device code. Starting in PGI 14.1 on Linux and in PGI 14.2 on Windows, the default code generation and linking mode for Tesla-target OpenACC and CUDA Fortran is rdc, relocatable device code. You can disable the default and enable the old behavior and non-relocatable code by specifying any of the following: –ta=tesla:nordc, –Mcuda=nordc, or by specifying any 1.x compute capability or any Radeon target.

So a specific option to (disable)enable this is:

–Mcuda=(no)rdc

(note that -Mcuda=rdc is the default, if you don't specify this option)

CUDA Fortran separates Fortran host code from device code. For the device code, the CUDA Fortran compiler does a CUDA Fortran->CUDA C conversion, and passes the auto-generated CUDA C code to the CUDA C compiler. Therefore, the behavior and expectations of switches like rdc and ptxinfo are derived from the behavior of the underlying equivalent CUDA compiler options (-rdc=true and -Xptxas -v, respectively).

When CUDA device code is compiled without the rdc option, the compiler will normally try to inline device (sub)routines that are called from a kernel, into the main kernel code. Therefore, when the compiler is generating the ptxinfo, it can determine all resource requirements (e.g. shared memory, registers, etc.) when it is compiling (ptx assembly) the kernel code.

When the rdc option is specified, however, the compiler may (depending on some other switches and function attributes) leave the device subroutines as separately callable routines with their own entry point (i.e. not inlined). In that scenario, when the device compiler is compiling the kernel code, the call to the device subroutine just looks like a call instruction, and the compiler (at that point) has no visibility into the resource usage requirements of the device subroutine. This does not mean that there is an underlying flaw in the compile sequence. It simply means that the ptxinfo mechanism cannot accurately roll up the resource requirements of the kernel and all of it's called subroutines, at that point in time.

The ptxinfo output also does not declare the total amount of shared memory used by a device subroutine, when it is compiling that subroutine, in rdc mode.

If you turn off the rdc mode:

–Mcuda=nordc

I believe you will see an accurate accounting of the shared memory used by a kernel plus all of its called subroutines, given a few caveats, one of which is that the compiler is able to successfully inline your called subroutines (pretty likely, and the accounting should still work even if it can't) another of which is that you are working with a kernel plus all of its called subroutines in the same file (i.e. translation unit). If you have kernels that are calling device subroutines in different translation units, then the rdc option is the only way to make it work.

Shared memory will still be appropriately allocated for your code at runtime, regardless (assuming you have not violated the total amount of shared memory available). You can also get an accurate reading of the shared memory used by a kernel by profiling your code, using a profiler such as nvvp or nvprof.

If this explanation doesn't describe what you are seeing, I would suggest providing a complete sample code, as well as the exact compile command you are using, plus the version of PGI tools you are using. (I think it's a good suggestion for future questions as well.)

Determining shared memory usage in CUDA Fortran

1 Answers