0
votes

here is simple cuda code.
I am testing the time of accessing global memory. read and right.

below is kernel function(test1()).

enter code here

__global__ void test1(int *direct_map)   
{  
    int index = 10;  
    int index2;  

    for(int j=0; j<1024; j++)  
    {  
        index2 = direct_map[index];  
        direct_map[index] = -1;  
        index = index2;  
    }  
}  

direct_map is 683*1024 linear matrix and, each pixel has a offset value to access to other pixel.
index and index2 is not continued address.

this kernel function needs about 600 micro second.

But, if i delete the code,
direct_map[index] = -1;

just takes 27 micro second.

I think the code already read the value of direct_map[index] from global memory from

index2 = direct_map[index];

then, it should be located L2 cache.
So, when doing "direct_map[index] = -1;", the speed should be fast.

And, I tested random writing to global memory(test2()).

It takes about 120 micro seconds.

enter code here

__global__ void test2(int *direct_map)   
{  
    int index = 10;  

    for(int j=0; j<1024; j++)  
    {  
        direct_map[index] = -1;  
        index = j*683 + j/3 - 1;  
    }  
}

So, I don't know why test1() takes over than 600 micro seconds. thank you.

1

1 Answers

1
votes

When you delete the code line:

direct_map[index] = -1; 

your kernel isn't doing anything useful. The compiler can recognize this and eliminate most of the code associated with the kernel launch. That modification to the kernel code means that the kernel no longer affects any global state and the code is effectively useless, from the compiler's perspective.

You can verify this by dumping the assembly code that the compiler generates in each case, for example with cuobjdump -sass myexecutable

Anytime you make a small change to the code and see a large change in timing, you should suspect that the change you made has allowed the compiler to make different optimization decisions.