I am currently learning CUDA, and am working through some exercises. One of them is to implement kernels that add matrices in 3 different ways: 1 thread per element, 1 thread per row, and 1 thread per column. The matrices are square, and are implemented as 1D vectors, that I simply index into with
A[N*row + col]
Intuitively, I expected the first option to be the slowest due to thread overhead, the second to be the fastest since a single thread would be working on adjacent data.
On the CPU, with dense matrices of 8000 x 8000 I get:
Adding on CPU - Adding down columns
Compute Time Taken: 2.21e+00 s
Adding on CPU - Adding across rows
Compute Time Taken: 2.52e-01 s
So about an order of magnitude speed up due to many more cache hits. On the GPU with the same matrices I get:
Adding one element per thread
Compute Time Taken: 7.42e-05 s
Adding one row per thread
Compute Time Taken: 2.52e-05 s
Adding one column per thread
Compute Time Taken: 1.57e-05 s
Which in non-intuitive to me. The 30-40% speed up for the last case is consistent above about 1000 x 1000 matrices. Note that these timings are only the kernel execution, and don't include the data transfer between host and device. Below are my two kernels for comparison.
__global__
void matAddKernel2(float* A, float* B, float* C, int N)
{
int row = threadIdx.x + blockDim.x * blockIdx.x;
if (row < N)
{
int j;
for (j = 0; j < N; j++)
{
C[N*row + j] = A[N*row + j] + B[N*row + j];
}
}
}
__global__
void matAddKernel3(float* A, float* B, float* C, int N)
{
int col = threadIdx.x + blockDim.x * blockIdx.x;
int j;
if (col < N)
{
for (j = 0; j < N; j++)
{
C[col + N*j] = A[col + N*j] + B[col + N*j];
}
}
}
My question is, why don't GPU threads seem to benefit from working on adjacent data, which would then help it to get more cache hits?