Non-deterministic CUDA C kernel

Question

I'm still a beginner with CUDA and I have been trying to write a simple kernel to perform a parallel prime sieve on the GPU. Originally I had written my code in C but I wanted to investigate the speed up on a GPU so I rewrote it:

41.cu

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>

#define B 1024
#define T 256
#define N (B*T)

#define checkCudaErrors(error) {\
    if (error != cudaSuccess) {\
        printf("CUDA Error - %s:%d: '%s'\n",__FILE__,__LINE__,cudaGetErrorString(error));\
        exit(1);\
        }\
}\

__global__ void prime_sieve(int *primes) {
    unsigned int i = threadIdx.x + blockIdx.x * blockDim.x;
    primes[i] = i;
    primes[0] = primes[1] = 0;
    if (i > 1 && i<N) { 
        for (int j=2; j<N/2; j++) {
            if (i*j < N) {
                primes[i*j] = 0;
            }
        }
    }
}

int main() {
    int *h_primes=(int*)malloc(N * sizeof(int));
    int *d_primes;
    checkCudaErrors(cudaMalloc( (void**)&d_primes, N*sizeof(int)));
    checkCudaErrors(cudaMemcpy(d_primes,h_primes,N*sizeof(int),cudaMemcpyHostToDevice));    
    prime_sieve<<<B,T>>>(d_primes);
    checkCudaErrors(cudaMemcpy(h_primes,d_primes,N*sizeof(int),cudaMemcpyDeviceToHost));
    checkCudaErrors(cudaFree(d_primes));

    int size = 0;
    int total = 0;
    for (int i=2; i<N; i++) {
        if (h_primes[i]) {
            size++;
        }
        total++;
    }
    printf("\n");
    printf("Length = %d\tPrimes = %d\n",total,size);
    free(h_primes);
    return 0;
}

I run the program on Ubuntu 16.04 (4.4.0-83-generic) and I compile using nvcc 41.cu -o 41.o -arch=sm_30 under version 8.0.61. The program is run on a GeForce GTX 780 Ti but everytime it runs, it always produces non-deterministic results:

Length = 262142 Primes = 49477
Length = 262142 Primes = 49486
Length = 262142 Primes = 49596
Length = 262142 Primes = 49589

There were no errors reported back. At first I thought it was a race condition but cuda-memcheck didn't report back any hazards for racecheck,initcheck or synccheck and I couldn't think of any problems with my assumptions. I was thinking this could be a synchronisation problem?

This non-deterministic behaviour only occurs when I increase the block size and thread size as seen in the code. When I tried a block size and thread size of say 16, then there were no problems (as far as I could tell). It seems that not all threads get the chance to execute? I was planning to run this on very large array sizes (< 1 billion integers) but I am stuck at this point.

What am I doing wrong here?

Davide Spataro Davide Spataro · Accepted Answer · 2017-07-12T22:03:04

There is a giant race-condition

So prime[i] > 0 means prime, while prime[i]=0 means composite.

primes[i] = i; is executed as first update on primes by each thread. Keep this in mind.

Now let's see what happen when thread 16 executes. It marks primes[16]=16 and and all multiples of 16 too. Something like the following

primes[16] = primes[32] = primes[48]=....=primes[k*16]=0

Imagine that thread 48 gets scheduled just after thread 16 completed its job (or when j>3 in thread 16 loop`).

Thread 48 sets primes[48] = 48. You have lost the update made by thread 16.

That is a race condition.

When coding in CUDA you should make sure that the correctness of your code does not depend on a particular scheduling of warps.

You should think as the order of execution as something non-deterministic.

Non-deterministic CUDA C kernel

1 Answers

There is a giant race-condition