Matrix manipulation using CUDA

Question

I am trying to write a program for matrix calculations using C/CUDA. I have the following program:

In main.cu

#include <cuda.h>
#include <iostream>
#include "teste.cuh"
using std::cout;

int main(void)
{
 const int Ndofs = 2;
 const int Nel   = 4;
 double *Gh   = new double[Ndofs*Nel*Ndofs*Nel];
 double *Gg;
 cudaMalloc((void**)& Gg, sizeof(double)*Ndofs*Nel*Ndofs*Nel);
 for (int ii = 0; ii < Ndofs*Nel*Ndofs*Nel; ii++)
  Gh[ii] = 0.;
 cudaMemcpy(Gh, Gg, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyHostToDevice);
 integraG<<<256, 256>>>(Nel, Gg);
 cudaMemcpy(Gg, Gh, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyDeviceToHost);
 for (int ii = 0; ii < Ndofs*Nel*Ndofs*Nel; ii++)
  cout << ii  + 1 << " " << Gh[ii] << "\n";
 return 0;
}

In mtrx.cuh

#ifndef TESTE_CUH_
#define TESTE_CUH_

__global__ void integraG(const int N, double* G)
{

    const int szmodel = 2*N;
    int idx = threadIdx.x + blockIdx.x*blockDim.x;
    int idy = threadIdx.y + blockIdx.y*blockDim.y;
    int offset = idx + idy*blockDim.x*gridDim.x;
    int posInit = szmodel*offset;

    G[posInit + 0] = 1;
    G[posInit + 1] = 1;
    G[posInit + 2] = 1;
    G[posInit + 3] = 1;
}

#endif

The result (which is supposed to be a matrix filled with 1's) is copied back to the host array; The problem is: nothing happens! Apparently, my program is not calling the gpu kernel, and I am still getting an array full of zeros.

I am very new to CUDA programming and I am using CUDA by example (Jason Sanders) as a reference book.

My questions are:

What is wrong with my code?
Is this the best way to deal with matrices using GPU, using matrices vectorized form?
Is there another reference that can provide more examples on matrices using GPU's?

It is going to be very hard to help you if you don't ask a question and don't show us a short complete example of the code causing the problem. Are you sure you have a working CUDA installation? — talonmies
The questions didn't appear in the post. Sorry. Just edited it. — Gabs
your code does not compile, post a Minimal, Complete, and Verifiable example. In addition to that, add proper CUDA error checking. also be aware of the order of the parameters of cuMemcpy. — m.s.

Iman Iman · Accepted Answer · 2015-07-23T23:40:04

These are your questions:

What is wrong with my code?

Is this the best way to deal with matrices using GPU, using matrices vectorized form?

Is there another reference that can provide more examples on matrices using GPU's?

For your first question. First of all, your problem should explicitly be defined. What do you want to do with this code? what sort of calculations do you want to do on the Matrix?

Try to check for errors properly THIS is a very good way to do so. There are some obvious bugs in your code as well. some of your bugs:

You're passing the wrong address pointers to the cudaMemcpy, the pointers that are passed to the source and the destination have to be swapped with each other, Check here

Change them to:

"NdofsNelNdofs*Nel" shows that you're interested in the value of the first 64 numbers of the array, so why calling 256 Threads and 256 blocks?
This part of your code:

int idx = threadIdx.x + blockIdx.xblockDim.x; int idy = threadIdx.y + blockIdx.yblockDim.y;

shows that you want to use 2-Dim threads and blocks; to do that so, you need to use Dim type.

By making the following changes:

 cudaMemcpy(Gg, Gh, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyHostToDevice); //HERE
 dim3 block(2,2); //HERE
 dim3 thread(4,4); //HERE
 integraG<<<block, thread>>>(Nel, Gg); //HERE
 cudaMemcpy(Gh, Gg, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyDeviceToHost); //HERE

You'll get a result like the following:

Anyway, if you state your problem and goal more clearly, better suggestions can be provided for you.

Regarding to your last two questions:

In my opinion CUDA C PROGRAMMING GUIDE and CUDA C BEST PRACTICES GUIDE are the two must documents to read when starting with CUDA, and they include examples on Matrix calculations as well.

Matrix manipulation using CUDA

1 Answers