Cuda matrix multiplication gives wrong answer

Question

Update!

My current code doesn't check for out of bounds memory access. When I run the cuda memcheck, it says memory access is bad even for matrices of just 2 by 2! I'm accessing memory where I shouldn't somehow and that's the problem!

To check for out of bounds memory access, run cuda-memcheck ./(insert executable here)

Shown below is my code for the matrix multiplication itself:

dim3 block(32,32);
dim3 grid( (n+31)/32, (n+31)/32 );
matrixMul<<<grid,block>>>(d_C, d_A, d_B, n, k);

kA and kB are matrices with values in them (they're all 2's to make it easier).

m, n, k are all the same number for my square matrices

kC is the matrix to store the answer.

#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_

#include <stdio.h>

__global__ void matrixMul(float *kC, float *kA, float *kB, int n, int k)
{

    int tx = blockIdx.x * 32 + threadIdx.x;
    int ty = blockIdx.y * 32 + threadIdx.y;
    float value = 0;

    for (int i=0;i<n;i++)
    {
        float elementA=kA[ty*n+i];
        float elementB=kB[i*k+tx];
        value += elementA*elementB;
    }

    kC[ty*n+tx] = value;
}

#endif // #ifndef _MATRIXMUL_KERNEL_H_

Where is the definition of the variable value ? Are you initializing it to zero before the for loop? This code doesn't look like it would compile. — Robert Crovella
@Robert Crovella my bad, I wrote the wrong variable name when I moved the code. It should be "float value" and not "float storage." I have "float value" in my program though. Do you think it might be my memory if it works all the way up to 10x10's? After 10x10's the values double for some rows of the matrix or sometimes something else... — Mechy
This kernel has no thread checks like if ((ty < n) && (tx < k)){. So you should show your kernel invocation as well, and probably the data copy portion of the host code as well. — Robert Crovella

Unknown Unknown · Accepted Answer · 2013-05-19T03:16:09

Based on how you are defining the grid of threads, you should add a thread check to the kernel code like this:

#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_

#include <stdio.h>

__global__ void matrixMul(float *kC, float *kA, float *kB, int n, int k)
{

    int tx = blockIdx.x * 32 + threadIdx.x;
    int ty = blockIdx.y * 32 + threadIdx.y;

    if ((ty < n) && (tx < n)) { // add this line
      float value = 0;

      for (int i=0;i<n;i++)
      {
        float elementA=kA[ty*n+i];
        float elementB=kB[i*k+tx];
        value += elementA*elementB;
      }

      kC[ty*n+tx] = value;
    }  //  add this line
}

#endif // #ifndef _MATRIXMUL_KERNEL_H_

Otherwise threads outside the valid array array will corrupt your results. Things work for multiples of 32x32 because there are no invalid threads. In that case you're launching exactly the required number of threads. But in other cases you are launching extra threads. These extra threads, if allowed to compute an invalid matrix position, will corrupt the results.

Cuda matrix multiplication gives wrong answer

1 Answers