At about what code complexity do OpenACC kernels lose efficiency on common GPU and register, shared memory operations or some other aspect starts to bottleneck performance?
Also is there some point where too few tasks and overhead of transferring to GPU and cores would become a bottleneck?
Would cache sizes and if code fits indicate optimal task per kernel or something else?
About how big is the OpenACC overhead per kernel compared to potential performance and does it vary a lot with various directives?