3
votes

I am currently learning CUDA, and am working through some exercises. One of them is to implement kernels that add matrices in 3 different ways: 1 thread per element, 1 thread per row, and 1 thread per column. The matrices are square, and are implemented as 1D vectors, that I simply index into with

A[N*row + col]

Intuitively, I expected the first option to be the slowest due to thread overhead, the second to be the fastest since a single thread would be working on adjacent data.

On the CPU, with dense matrices of 8000 x 8000 I get:

Adding on CPU - Adding down columns
Compute Time Taken: 2.21e+00 s
Adding on CPU - Adding across rows
Compute Time Taken: 2.52e-01 s

So about an order of magnitude speed up due to many more cache hits. On the GPU with the same matrices I get:

Adding one element per thread 
Compute Time Taken: 7.42e-05 s
Adding one row per thread 
Compute Time Taken: 2.52e-05 s
Adding one column per thread 
Compute Time Taken: 1.57e-05 s

Which in non-intuitive to me. The 30-40% speed up for the last case is consistent above about 1000 x 1000 matrices. Note that these timings are only the kernel execution, and don't include the data transfer between host and device. Below are my two kernels for comparison.

__global__
void matAddKernel2(float* A, float* B, float* C, int N)
{
        int row = threadIdx.x + blockDim.x * blockIdx.x;
        if (row < N)
        {
                int j;
                for (j = 0; j < N; j++)
                {
                        C[N*row + j] = A[N*row + j] + B[N*row + j];
                }
        }
}



__global__
void matAddKernel3(float* A, float* B, float* C, int N)
{
        int col = threadIdx.x + blockDim.x * blockIdx.x;
        int j;

        if (col < N)
        {
                for (j = 0; j < N; j++)
                {
                        C[col + N*j] = A[col + N*j] + B[col + N*j];
                }
        }
}

My question is, why don't GPU threads seem to benefit from working on adjacent data, which would then help it to get more cache hits?

1

1 Answers

5
votes

GPU threads do benefit from working on adjacent data, what you are missing is that GPU threads are not independent threads like CPU thread, they work in a group called warp. A warp groups together 32 threads and works in a similar fashion like a single CPU thread executing SIMD instructions with width 32.

So in reality the code that uses one thread per column is the most effective because adjacent threads inside a warp are accessing adjacent data locations from memory and that is the most effective way to access global memory.

You will find the details in the CUDA documentation.