I am trying to use !$acc cache for a specific loop inside a Laplace 2D solver. When I analyse the code with -Mcuda=ptxinfo, it shows no use of shared memory (smem) but the code runs slower than the base condition?!
Here is a part of the code:
!$acc parallel loop reduction(max:error) num_gangs(n/THREADS) vector_length(THREADS)
do j=2,m-1
do i=2,n-1
#ifdef SHARED
!$acc cache(A(i-1:i+1,j),A(i,j-1:j+1))
#endif
Anew(i,j) = 0.25 * ( A(i+1,j) + A(i-1,j) + A(i,j-1) + A(i,j+1) )
error = max( error, abs( Anew(i,j) - A(i,j) ) )
end do
end do
!$acc end parallel
This is the output with using !$acc cache
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_20'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 28 registers, 96 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_20'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 12 registers, 96 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_20'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 20 registers, 64 bytes cmem[0]
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_30'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 37 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_30'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_30'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 20 registers, 352 bytes cmem[0]
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_35'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 38 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_35'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_35'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 39 registers, 352 bytes cmem[0]
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_50'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 37 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_50'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 12 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_50'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 30 registers, 352 bytes cmem[0]
This is the output without cache:
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_20'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 23 registers, 88 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_20'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 12 registers, 88 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_20'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 20 registers, 64 bytes cmem[0]
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_30'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 29 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_30'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_30'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 20 registers, 352 bytes cmem[0]
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_35'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 36 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_35'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_35'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 39 registers, 352 bytes cmem[0]
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'acc_lap2d_39_gpu' for 'sm_50'
ptxas info : Function properties for acc_lap2d_39_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 38 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_39_gpu_red' for 'sm_50'
ptxas info : Function properties for acc_lap2d_39_gpu_red
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 12 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function 'acc_lap2d_58_gpu' for 'sm_50'
ptxas info : Function properties for acc_lap2d_58_gpu
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 30 registers, 352 bytes cmem[0]
Also it shows by -Minfo=accel that some amount of memory has been cached:
acc_lap2d:
17, Generating copy(a(:4096,:4096))
Generating create(anew(:4096,:4096))
39, Accelerator kernel generated
Generating Tesla code
39, Max reduction generated for error
40, !$acc loop gang(256) ! blockidx%x
41, !$acc loop vector(16) ! threadidx%x
Cached references to size [(x)x3] block of a
Loop is parallelizable
58, Accelerator kernel generated
Generating Tesla code
59, !$acc loop gang ! blockidx%x
60, !$acc loop vector(128) ! threadidx%x
Loop is parallelizable
I am wondering how to use the cache (shared memory in CUDA sense) efficiently in OpenACC?
Thank you so much for your help.
Behzad