Recently I am studying about CUDA. I want to know about CUDA memory access times.
In, CUDA Programming Guide written memory access times:
- Global memory access time is 400 ~ 600 Cycle
- Shared memory( L1 Cache ) access time is 20 ~ 40 Cycle
I think that Cycle is same as clock. Is this correct ? If that is Correct, so I examined memory access times. The host is fixed but the kernel code has 3 versions. This is My Code:
host Code
float* H1 = (float*)malloc(sizeof(float)*100000);
float* D1;
for( int i = 0 ; i < 100000 ; i++ ){
H1[i] = i;
}
cudaMalloc( (void**)&D1, sizeof(float)*100000);
cudaMemcpy( D1, H1, sizeof(float)*100000, cudaMemcpyHostToDevice );
cudaPrintfInit();
test<<<1,1>>>( D1 );
cudaPrintfDisplay(stdout, true);
cudaPrintfEnd();
kernel version 1
float Global1;
float Global2;
float Global3;
int Clock;
Clock = clock();
Global1 = Dev_In1[1];
Clock = clock() - Clock;
cuPrintf("Global Memory Access #1 : %d\n", Clock );
Clock = clock();
Global2 = Dev_In1[2];
Clock = clock() - Clock;
cuPrintf("Global Memory Access #2 : %d\n", Clock );
Clock = clock();
Global3 = Dev_In1[3];
Clock = clock() - Clock;
cuPrintf("Global Memory Access #3 : %d\n", Clock );
It`s result
Global Memory Access #1 : 882
Global Memory Access #2 : 312
Global Memory Access #3 : 312
I think that first access not cache so took 800 Cycle but 2nd access 3rd access took 312 Cycle because, Dev_In[2], Dev_In[3] is cached..
kernel version 2
int Global1, Global2, Global3;
int Clock;
Clock = clock();
Global1 = Dev_In1[1];
Clock = clock() - Clock;
cuPrintf("Global Memory Access #1 : %d\n", Clock );
Clock = clock();
Global2 = Dev_In1[50000];
Clock = clock() - Clock;
cuPrintf("Global Memory Access #2 : %d\n", Clock );
Clock = clock();
Global3 = Dev_In1[99999];
Clock = clock() - Clock;
cuPrintf("Global Memory Access #3 : %d\n", Clock );
it`s result
Global Memory Access #1 : 872
Global Memory Access #2 : 776
Global Memory Access #3 : 782
i think that not cached Dev_In1[50000] and Dev_In2[99999] at 1st access time
so... #1,#2,#3 is late...
kernel version 3
int Global1, Global2, Global3;
int Clock;
Clock = clock();
Global1 = Dev_In1[1];
Clock = clock() - Clock;
cuPrintf("Global Memory Access #1 : %d\n", Clock );
Clock = clock();
Global1 = Dev_In1[50000];
Clock = clock() - Clock;
cuPrintf("Global Memory Access #2 : %d\n", Clock );
Clock = clock();
Global1 = Dev_In1[99999];
Clock = clock() - Clock;
cuPrintf("Global Memory Access #3 : %d\n", Clock );
result
Global Memory Access #1 : 168
Global Memory Access #2 : 168
Global Memory Access #3 : 168
I don`t understand this result
Dev_In[50000], Dev_In[99999] is not cached, but access time is very fast!! just, i used 1 variable....
SO.. My question is that gpu cycle == gpu clock ?
and In result1, result2, result3 why memory access time is very fast in result3?