2
votes

I wrote some pretty simple GPU code here in CUDA C to copy an array, nums, into an array, vals. Nums is [4,7,1,9,2]. This is how I wanted to copy each element over:

__global__ void makeArray(int*);

int main()
{
  int* d_nums;
  int nums[5];

  nums[0] = 4;
  nums[1] = 7;
  nums[2] = 1;
  nums[3] = 9;
  nums[4] = 2;
  cudaMalloc(&d_nums, sizeof(int)*5);

  makeArray<<<2,16>>>(d_nums);

  cudaMemcpy(nums, d_nums, sizeof(int)*5, cudaMemcpyDeviceToHost);

  for (int i = 0; i < 5; i++)
    cout << i << " " << nums[i] << endl;

  return 0;
}

__global__ void makeArray(int* nums)
{
  int vals[5];
  int threadIndex = blockIdx.x * blockDim.x + threadIdx.x;

  vals[threadIndex%5] = nums[threadIndex%5];
  __syncthreads();

  if (threadIndex < 5)
    nums[threadIndex] = vals[threadIndex];
}

In the long run, I want to transfer an array from the CPU to the GPU shared memory using this method, but I can't even get this simple practice file to work. I'm expecting the output to look something like this:

0 4
1 7
2 1
3 9
4 2

But I'm getting this:

0 219545856
1 219546112
2 219546368
3 219546624
4 219546880

My thought process is that by using the modulus of the thread index, which is greater than the number of elements in this array, I can cover all 5 data points, and not worry about over reading the array. I can also assign each array spot at the same time, one per thread, and then __syncthreads() at the end to make sure every thread is done copying over. Clearly, that isn't working. Help!

1

1 Answers

0
votes

After your edit, we can see d_nums points to uninitialised memory. You just allocated it and didn't fill it with anything. If you want data accessible to the GPU, you have to copy it:

cudaMemcpy(d_nums, nums, sizeof(nums), cudaMemcpyHostToDevice);

before you run the kernel.