2
votes

Some questions about CUDA.

1) I noticed that, in every sample code, operations which are not parallel (i.e., the computation of a scalar), performed in global functions, are always done specifying a certain thread. For example, in this simple code for a dot product, thread 0 performs the summation:

    __global__ void dot( int *a, int *b, int *c )
    {
       // Shared memory for results of multiplication   
       __shared__ int temp[N]; 
       temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x];

       // Thread 0 sums the pairwise products
       if( 0 == threadIdx.x ) 
       {
       int sum = 0;
       for( int i = 0; i < N; i++ )
           sum += temp[i];

       *c = sum;
       }
    }

This is fine for me; however, in a code which I wrote I did not specify the thread for the non-parallel operation, and it still works: hence, is it compulsory to define the thread? In particular, the non-parallel operation which I want to perform is the following:

        if (epsilon == 1)
        {
            V[0] = B*(Exp - 1 - b);
        }
        else
        {
            V[0] = B*(Exp - 1 + a);
        }

The various variables were passed as arguments of the global function. And here comes my second question.

2) I computed the value of V[0] with a program in CUDA and another serial on the CPU, obtaining different results. Obviously I thought that the problem in CUDA could be that I did not specify the thread, but, even with this, the result does not change, and it is still (much) greater from the serial one: 6.71201e+22 vs -2908.05. Where could be the problem? The other calculations performed in the global function are the following:

int tid = threadIdx.x;

if ( tid != 0 && tid < N )
{
    {Various stuff which does not involve V or the variables used to compute V[0]}

    V[tid] = B*(1/(1+alpha[tid]*alpha[tid])*(One_G[tid]*Exp - Cos - alpha[tid]*Sin) + kappa[tid]*Sin);
}

As you can see, in my condition I avoid to consider the case tid == 0.

3) Finally, a last question: usually in the sample codes I noticed that, if you want to use on the CPU values allocated and computed on the GPU memory, you should copy those values on the CPU (e.g, with command cudaMemcpy, specifying cudaMemcpyDeviceToHost). But I manage to use those values directly in the main code (CPU) without any problem. Can be this a clue that there is something wrong with my GPU (or my installation of CUDA), which also causes the previous odd things?

Thank you for your help.

== Added on the 5th January ==

Sorry for the late of my reply. Before invoking the kernel, there are all the memory allocations of the arrays to compute (which are quite a lot). In particular, the code about the array involved in my question is:

float * V;
cudaMalloc( (void**)&V, N * sizeof(float) );

At the end of the code I wrote:

float V_ [N];
cudaMemcpy( &V_, V, N * sizeof(float), cudaMemcpyDeviceToHost );

cudaFree(V);

cout << V_[0] << endl;

Thank you again for your attention.

2
Have a look at this NVIDIA whitepaper: developer.download.nvidia.com/assets/cuda/files/… - njuffa
@njuffa Thank you for your link. I read some part of the paper, but my problem is different: I get largely different results, while the paper suggests that the results should still be quite similar. - Pippo
Small differences could be amplified by subsequent computation. It's also possible that the problem is not with numerical differences but a bug in the code. - njuffa
Yes, but my problem is that I tried a really small computation (the one in the question) and the result still does not make sense. About the bug in the code: I'll try with just a trivial implementation of my code, and see what I obtain. Thank you for your help! - Pippo

2 Answers

2
votes

if you don't have any cudaMemcpy in your code, that's exactly the problem. ;-) The GPU is accessing it's own memory (the RAM on your graphics card), while the CPU is accessing the RAM on your mainboard. You need to allocate and copy alpha, kappa, One_g and all other arrays to your GPU first, using cudaMemcpy, then run your kernel and after that copy your results back to the CPU. Also, don't forget to allocate the memory on BOTH sides.

As for the non-parallel stuff: If the result is always the same, all threads will write the same thing, so the result is exactly the same, just quite a bit more inefficient, since all of them try to access the same resources.

1
votes

Is that the exact code you're using? In regards to question 1, you should have a __syncthreads() after the assignment to your shared memory, temp. Otherwise you'll get a race condition where thread 0 can start the summation prior to temp being fully populated.

As for your other question about specifying the thread, if you have

if (epsilon == 1)
{
    V[0] = B*(Exp - 1 - b);
}
else
{
    V[0] = B*(Exp - 1 + a);
}

Then every thread will execute that code; for example, if you have X number of threads executing, and epsilon is 1 for all of them, then all X threads will evaluate the same line:

V[0] = B*(Exp - 1 - b);

and hence you'll have another race condition, as you'll have all X threads writing to V[0]. If all the threads have the same value for B*(Exp - 1 - b), then you might not notice a difference, while if they have different values then you're liable to get different results each time, depending on what order the threads arrive