0
votes

I am new to GPU programming (and rather rusty in C) so this might be a rather basic question with an obvious bug in my code. What I am trying to do is take a 2 dimensional array and find the sum of each column for every row. So If I have a 2D array that contains:

0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 2 4 6 8 10 12 14 16 18

I want to get an array that contains the following out:

45
45
90

The code I have so far is not returning the correct output and I'm not sure why. I'm guessing it is because I am not handling the indexing in the kernel properly. But it could be that I am not using the memory correctly since I adapted this from an over-simplified 1 dimensional example and the CUDA Programming Guide (section 3.2.2) makes a rather big and not very well described jump for a beginner between 1 and 2 dimensional arrays.

My incorrect attempt:

#include <stdio.h>
#include <stdlib.h>


// start with a small array to test
#define ROW 3
#define COL 10

__global__ void collapse( int *a, int *c){
    /*
       Sum along the columns for each row of the 2D array.
    */
    int total = 0;
    // Loop to get total, seems wrong for GPUs but I dont know a better way
    for (int i=0; i < COL; i++){
        total = total + a[threadIdx.y + i];
    }
    c[threadIdx.x] = total;

}

int main( void ){
    int array[ROW][COL];      // host copies of a, c
    int c[ROW];
    int *dev_a;      // device copies of a, c (just pointers)
    int *dev_c;

    // get the size of the arrays I will need
    int size_2d = ROW * COL * sizeof(int);
    int size_c = ROW * sizeof(int);

    // Allocate the memory
    cudaMalloc( (void**)&dev_a, size_2d);
    cudaMalloc( (void**)&dev_c, size_c);

    // Populate the 2D array on host with something small and known as a test
    for (int i=0; i < ROW; i++){
        if (i == ROW - 1){
            for (int j=0; j < COL; j++){
                array[i][j] = (j*2);
                printf("%i ", array[i][j]);
            }
        } else {
            for (int j=0; j < COL; j++){
                array[i][j] = j;
                printf("%i ", array[i][j]);
            }
        }
        printf("\n");
    }

    // Copy the memory
    cudaMemcpy( dev_a, array, size_2d, cudaMemcpyHostToDevice );
    cudaMemcpy( dev_c, c, size_c, cudaMemcpyHostToDevice );

    // Run the kernal function
    collapse<<< ROW, COL >>>(dev_a, dev_c);

    // copy the output back to the host
    cudaMemcpy( c, dev_c, size_c, cudaMemcpyDeviceToHost );

    // Print the output
    printf("\n");
    for (int i = 0; i < ROW; i++){
        printf("%i\n", c[i]);
    }

    // Releasae the memory
    cudaFree( dev_a );
    cudaFree( dev_c );
}

Output:

0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 2 4 6 8 10 12 14 16 18

45
45
45
1
I never used CUDA before, but looking at the code, and the linked doc, I suspect you are either missing that blockDim thing, or perhaps collapse should have a signature collapse(int[ROW][COL] a, int[ROW] c) and then use a[threadIdx.x][threadIdx.y]Gábor Buella
or a[threadIdx.y][threadIdx.x] I'm not sure which one goes where, didn't read the whole docGábor Buella
Also, if it really creates ROW times COL threads, you don't need to loop inside that kernal, it will be called 30 times. Just make one addition in each call. At which point of course you meet the question of thread safety...Gábor Buella
@BuellaGábor Yes, I certainly did miss the blockIdx in my code completely. I initially though that it would create ROW x COL threads as suggested in your last comment which led to much confusion, but that is not the case, it creates <<<number_of_blocks, number_of_threads_per_block>>>veda905
okey, I deleted my answer, since it was wrong, and misleadingGábor Buella

1 Answers

8
votes

You are correct, it's an indexing issue. Your kernel will generate a correct answer if you replace this:

    total = total + a[threadIdx.y + i];

with this:

    total = total + a[blockIdx.x*COL + i];

and this:

c[threadIdx.x] = total;

with this:

c[blockIdx.x] = total;

However there's more to say than that.

  1. Any time you're having trouble with a CUDA code, you should use proper cuda error checking. The second issue above was definitely resulting in a memory access error, and you may have gotten a hint of this with error checking. You should also run your codes with cuda-memcheck which will do an extra-tight job of bounds checking, and it would definitely catch the out-of-bounds access your kernel was making.

  2. I think you may be confused with kernel launch syntax: <<<ROW, COL>>> You may be thinking that this maps into 2D thread coordinates (I'm just guessing, since you used threadIdx.y in a kernel where it has no meaning.) However the first parameter is the number of blocks to be launched, and the second is the number of threads per block. If you provide scalar quantities (as you have) for both of these, you will be launching a 1D grid of 1D threadblocks, and your .y variables won't really be meaningful (for indexing). So one takeaway is that threadIdx.y doesn't do anything useful in this setup (it is always zero).

  3. To fix that, we could make the first change listed at the beginning of this answer. Note that when we launch 3 blocks, each block will have a unique blockIdx.x so we can use that for indexing, and we have to multiply that by the "width" of your array to generate proper indexing.

  4. Since the second parameter is the number of threads per block, your indexing into C also didn't make sense. C only has 3 elements (which is sensible) but each block had 10 threads, and in each block the threads were trying into index into the "first 10" locations in C (each thread in a block has a unique value for threadIdx.x) But after the first 3 locations, there is no extra storage in C.

  5. Now possibly the biggest issue. Each thread in a block is doing exactly the same thing in the loop. Your code does not differentiate behavior of threads. You can write code that gives the correct answer this way, but it's not sensible from a performance standpoint.

  6. To fix this last issue, the canonical answer is to use a parallel reduction. That's an involved topic, and there are many questions about it here on the SO tag, so I'll not try to cover it, but point out to you that there is a good tutorial here along with the accompanying CUDA sample code that you can study. If you want to see a parallel reduction on the matrix rows, for example, you could look at this question/answer. It happens to be performing a max-reduction instead of a sum-reduction, but the differences are minor. You can also use an atomic method as suggested in the other answer, but that is generally not considered a "high-performance" approach, because the throughput of atomic operations is more limited than what is achievable with the ordinary CUDA memory bandwidth.

You also seem to be generally confused about the CUDA kernel execution model, so continued reading of the programming guide (that you've already linked) is a good starting point.