cuda global and shared memory access time

Question

Recently I am studying about CUDA. I want to know about CUDA memory access times.

In, CUDA Programming Guide written memory access times:

Global memory access time is 400 ~ 600 Cycle
Shared memory( L1 Cache ) access time is 20 ~ 40 Cycle

I think that Cycle is same as clock. Is this correct ? If that is Correct, so I examined memory access times. The host is fixed but the kernel code has 3 versions. This is My Code:

host Code

float* H1  = (float*)malloc(sizeof(float)*100000);
float* D1;

for( int i = 0 ; i < 100000 ; i++ ){
    H1[i]  = i;
}

cudaMalloc( (void**)&D1,   sizeof(float)*100000);
cudaMemcpy( D1, H1,    sizeof(float)*100000, cudaMemcpyHostToDevice );


cudaPrintfInit();
test<<<1,1>>>( D1 );
cudaPrintfDisplay(stdout, true);

cudaPrintfEnd();

kernel version 1

float Global1;
float Global2;
float Global3;

int Clock;

Clock = clock();
Global1 = Dev_In1[1];
Clock = clock() - Clock;
cuPrintf("Global Memory Access #1 : %d\n", Clock );

Clock = clock();
Global2 = Dev_In1[2];
Clock = clock() - Clock;
cuPrintf("Global Memory Access #2 : %d\n", Clock );

Clock = clock();
Global3 = Dev_In1[3];
Clock = clock() - Clock;
cuPrintf("Global Memory Access #3 : %d\n", Clock );

It`s result

Global Memory Access #1 : 882
Global Memory Access #2 : 312
Global Memory Access #3 : 312

I think that first access not cache so took 800 Cycle but 2nd access 3rd access took 312 Cycle because, Dev_In[2], Dev_In[3] is cached..

kernel version 2

int Global1, Global2, Global3;              
int Clock;              

Clock = clock();                
Global1 = Dev_In1[1];               
Clock = clock() - Clock;                
cuPrintf("Global Memory Access #1 : %d\n", Clock );             

Clock = clock();                
Global2 = Dev_In1[50000];               
Clock = clock() - Clock;                
cuPrintf("Global Memory Access #2 : %d\n", Clock );             

Clock = clock();                
Global3 = Dev_In1[99999];               
Clock = clock() - Clock;                
cuPrintf("Global Memory Access #3 : %d\n", Clock );

it`s result

Global Memory Access #1 : 872
Global Memory Access #2 : 776
Global Memory Access #3 : 782

i think that not cached Dev_In1[50000] and Dev_In2[99999] at 1st access time

so... #1,#2,#3 is late...

kernel version 3

int Global1, Global2, Global3;                  
int Clock;                  

Clock = clock();                    
Global1 = Dev_In1[1];                   
Clock = clock() - Clock;                    
cuPrintf("Global Memory Access #1 : %d\n", Clock );                 

Clock = clock();                    
Global1 = Dev_In1[50000];                   
Clock = clock() - Clock;                    
cuPrintf("Global Memory Access #2 : %d\n", Clock );                 

Clock = clock();                    
Global1 = Dev_In1[99999];                   
Clock = clock() - Clock;                    
cuPrintf("Global Memory Access #3 : %d\n", Clock );

result

Global Memory Access #1 : 168
Global Memory Access #2 : 168
Global Memory Access #3 : 168

I don`t understand this result

Dev_In[50000], Dev_In[99999] is not cached, but access time is very fast!! just, i used 1 variable....

SO.. My question is that gpu cycle == gpu clock ?

and In result1, result2, result3 why memory access time is very fast in result3?

Your compiler might optimize the codes. Namely, because you do not use Global1 in any place, your compiler might not read it all. Try to use the Global1 before reading the next value from memory. — phoad
The value returned by clock() should be at the value specified by the device attribute CU_DEVICE_ATTRIBUTE_CLOCK_RATE. The clock() value is in cycles and it is not fixed frequency. During the execution of a kernel the frequency is fixed on 2.x and older devices. On 3.x devices the clock rate will very with power management and thermal management. — Greg Smith
Are you running these three kernels on the same CUDA context during the same execution of a process or are you launching the process 3 times? If the former then running kernel2 then kernel3 with the same device memory will likely result in the value be cached in L2 which would account for the significantly lower elapse time. When using clock() you should verify that the SASS (assembly code, not PTX) has the correct instruction sequence. It is very common for the compiler to move or eliminate calls to clock(). I would recommend inserting a threadfence() after cuPrintf to flush the LSU unit. — Greg Smith

lashgar lashgar · Accepted Answer · 2012-09-19T10:57:45

For the reason stated by @phoad your evaluations are not valid. After memory access and before clock-stop you should reuse the memory read value to make instruction dependency to the outstanding load. Otherwise, GPU issues independent instructions one after the other and the clock-end get executed immediately after clock-start and the load. I suggest you to try the microbenchmarking suit prepared by Henry Wong at here. Using this suit you can retrieve various microarchitecture details including memory access latency. If you only need memory latency, it is easier to try CUDA latency which is developed by Sylvain Collange.

cuda global and shared memory access time

1 Answers