CUDA Matrix Addition Timings, By Row Vs. By Column

Question

I am currently learning CUDA, and am working through some exercises. One of them is to implement kernels that add matrices in 3 different ways: 1 thread per element, 1 thread per row, and 1 thread per column. The matrices are square, and are implemented as 1D vectors, that I simply index into with

A[N*row + col]

Intuitively, I expected the first option to be the slowest due to thread overhead, the second to be the fastest since a single thread would be working on adjacent data.

On the CPU, with dense matrices of 8000 x 8000 I get:

Adding on CPU - Adding down columns
Compute Time Taken: 2.21e+00 s
Adding on CPU - Adding across rows
Compute Time Taken: 2.52e-01 s

So about an order of magnitude speed up due to many more cache hits. On the GPU with the same matrices I get:

Adding one element per thread 
Compute Time Taken: 7.42e-05 s
Adding one row per thread 
Compute Time Taken: 2.52e-05 s
Adding one column per thread 
Compute Time Taken: 1.57e-05 s

Which in non-intuitive to me. The 30-40% speed up for the last case is consistent above about 1000 x 1000 matrices. Note that these timings are only the kernel execution, and don't include the data transfer between host and device. Below are my two kernels for comparison.

__global__
void matAddKernel2(float* A, float* B, float* C, int N)
{
        int row = threadIdx.x + blockDim.x * blockIdx.x;
        if (row < N)
        {
                int j;
                for (j = 0; j < N; j++)
                {
                        C[N*row + j] = A[N*row + j] + B[N*row + j];
                }
        }
}



__global__
void matAddKernel3(float* A, float* B, float* C, int N)
{
        int col = threadIdx.x + blockDim.x * blockIdx.x;
        int j;

        if (col < N)
        {
                for (j = 0; j < N; j++)
                {
                        C[col + N*j] = A[col + N*j] + B[col + N*j];
                }
        }
}

My question is, why don't GPU threads seem to benefit from working on adjacent data, which would then help it to get more cache hits?

RoBiK RoBiK · Accepted Answer · 2013-05-31T20:39:58

GPU threads do benefit from working on adjacent data, what you are missing is that GPU threads are not independent threads like CPU thread, they work in a group called warp. A warp groups together 32 threads and works in a similar fashion like a single CPU thread executing SIMD instructions with width 32.

So in reality the code that uses one thread per column is the most effective because adjacent threads inside a warp are accessing adjacent data locations from memory and that is the most effective way to access global memory.

You will find the details in the CUDA documentation.

CUDA Matrix Addition Timings, By Row Vs. By Column

1 Answers