At what code complexity does an OpenACC kernel lose efficiency on common GPU?

Question

At about what code complexity do OpenACC kernels lose efficiency on common GPU and register, shared memory operations or some other aspect starts to bottleneck performance?

Also is there some point where too few tasks and overhead of transferring to GPU and cores would become a bottleneck?

Would cache sizes and if code fits indicate optimal task per kernel or something else?

About how big is the OpenACC overhead per kernel compared to potential performance and does it vary a lot with various directives?

Mat Colgrove Mat Colgrove · Accepted Answer · 2017-03-13T15:55:23

I would refrain from using the complexity of the code as an indication of performance. You can have a highly complex code run very efficiently on a GPU and a simple code run poorly. Instead, I would look at the following factors:

Data movement between the device and host. Limit the frequency of data movement and try to transfer data in contiguous chunks. Use OpenACC unstructured data regions to match the host allocation on the device (i.e. use "enter data" at the same time as you allocate data via "new" or "malloc"). Move as much compute to the GPU as you can and only use the OpenACC update directive to synchronize host and device data when absolutely necessary. In case where data movement is unavoidable, investigate using the "async" clause to interleave the data movement with compute.
Data access on the device and limiting memory divergence. Be sure to have your data layout so that the stride-1 (contiguous) dimension of your arrays are accessed contiguously across the vectors.
Have a high compute intensity which is the ratio of computation to data movement. More compute and less data movement the better. However, lower compute intensity loops are fine if there are other high intensity loops and the cost to move the data to the host would offset the cost of running the kernel on the device.
Avoid allocating data on the device since it forces threads to serialize. This includes using Fortran "automatic" arrays, and declaring C++ objects with include allocation in their constructors.
Avoid atomic operations. Atomic operations are actually quite efficient when compared to host atomics, but still should be avoided if possible.
Avoid subroutine calls. Try to inline routines when possible.
Occupancy. Occupancy is the ratio of the number of threads that can potentially be running on the GPU over the maximum number of threads that could be running. Note that 100% occupancy does not guarantee high performance but you should try and get above 50% if possible. The limiters to occupancy are the number of registers used per thread (vector) and shared memory used per block (gang). Assuming you're using the PGI compiler, you can see the limits of your device by running the PGI "pgaccelinfo" utility. The number of registers used will depend upon the number of local scalars used (explicitly declared by the programmer and temporaries created by the compiler to hold intermediate calculations) and the amount of shared memory used will be determined by the OpenACC "cache" and "private" directives when "private" is used on a "gang" loop. You can see the how much each kernel uses by adding the flag "-ta=tesla:ptxinfo". You can limit the number of registers used per thread via "-ta=tesla:maxregcount:". Reducing the number of registers will increase the occupancy but also increase the number of register spills. Spills are fine so long as they only spill to L1/L2 cache. Spilling to global memory will hurt performance. It's often better to suffer lower occupancy than spilling to global memory.

Note that I highly recommend using a profiler (PGPROF, NVprof, Score-P, TAU, Vampir, etc.) to help discover a program's performance bottlenecks.

At what code complexity does an OpenACC kernel lose efficiency on common GPU?

1 Answers