here is simple cuda code.
I am testing the time of accessing global memory. read and right.
below is kernel function(test1()).
enter code here
__global__ void test1(int *direct_map)
{
int index = 10;
int index2;
for(int j=0; j<1024; j++)
{
index2 = direct_map[index];
direct_map[index] = -1;
index = index2;
}
}
direct_map is 683*1024 linear matrix and, each pixel has a offset value to access to other pixel.
index and index2 is not continued address.
this kernel function needs about 600 micro second.
But, if i delete the code,
direct_map[index] = -1;
just takes 27 micro second.
I think the code already read the value of direct_map[index] from global memory from
index2 = direct_map[index];
then, it should be located L2 cache.
So, when doing "direct_map[index] = -1;", the speed should be fast.
And, I tested random writing to global memory(test2()).
It takes about 120 micro seconds.
enter code here
__global__ void test2(int *direct_map)
{
int index = 10;
for(int j=0; j<1024; j++)
{
direct_map[index] = -1;
index = j*683 + j/3 - 1;
}
}
So, I don't know why test1() takes over than 600 micro seconds. thank you.