Update!
My current code doesn't check for out of bounds memory access. When I run the cuda memcheck, it says memory access is bad even for matrices of just 2 by 2! I'm accessing memory where I shouldn't somehow and that's the problem!
To check for out of bounds memory access, run cuda-memcheck ./(insert executable here)
Shown below is my code for the matrix multiplication itself:
dim3 block(32,32);
dim3 grid( (n+31)/32, (n+31)/32 );
matrixMul<<<grid,block>>>(d_C, d_A, d_B, n, k);
kA and kB are matrices with values in them (they're all 2's to make it easier).
m, n, k are all the same number for my square matrices
kC is the matrix to store the answer.
#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_
#include <stdio.h>
__global__ void matrixMul(float *kC, float *kA, float *kB, int n, int k)
{
int tx = blockIdx.x * 32 + threadIdx.x;
int ty = blockIdx.y * 32 + threadIdx.y;
float value = 0;
for (int i=0;i<n;i++)
{
float elementA=kA[ty*n+i];
float elementB=kB[i*k+tx];
value += elementA*elementB;
}
kC[ty*n+tx] = value;
}
#endif // #ifndef _MATRIXMUL_KERNEL_H_
value
? Are you initializing it to zero before the for loop? This code doesn't look like it would compile. – Robert Crovellaif ((ty < n) && (tx < k)){
. So you should show your kernel invocation as well, and probably the data copy portion of the host code as well. – Robert Crovella