My intention is to use n host threads to create n streams concurrently on a NVidia Tesla C2050. The kernel is a simple vector multiplication...I am dividing the data equally amongst n streams, and each stream would have concurrent execution/data transfer going on.
The data is floating point, I am sometimes getting CPU/GPU sums as equal, and sometimes they are wide apart...I guess this could be attributed to loss of synchronization constructs on my code, for my case, but also I don't think any synch constructs between streams is necessary, because I want every CPU to have a unique stream to control, and I do not care about asynchronous data copy and kernel execution within a thread.
Following is the code each thread runs:
//every thread would run this method in conjunction
static CUT_THREADPROC solverThread(TGPUplan *plan)
//Allocate memory
cutilSafeCall( cudaMalloc((void**)&plan->d_Data, plan->dataN * sizeof(float)) );
//Copy input data from CPU
cutilSafeCall( cudaMemcpyAsync((void *)plan->d_Data, (void *)plan->h_Data, plan->dataN * sizeof(float), cudaMemcpyHostToDevice, plan->stream) );
//to make cudaMemcpyAsync blocking
cudaStreamSynchronize( plan->stream );
launch_simpleKernel( plan->d_Data, BLOCK_N, THREAD_N, plan->stream);
cutilCheckMsg("simpleKernel() execution failed.\n");
//Read back GPU results
cutilSafeCall( cudaMemcpyAsync(plan->h_Data, plan->d_Data, plan->dataN * sizeof(float), cudaMemcpyDeviceToHost, plan->stream) );
//to make the cudaMemcpyAsync blocking...
cutilSafeCall( cudaFree(plan->d_Data) );
And creation of multiple threads and calling the above function:
for(i = 0; i < nkernels; i++)
threadID[i] = cutStartThread((CUT_THREADROUTINE)solverThread, &plan[i]);
printf("main(): waiting for GPU results...\n");
cutWaitForThreads(threadID, nkernels);
I took this strategy from one of the CUDA Code SDK samples. As I've said before, this code work sometimes, and other time it gives wayward results. I need help with fixing this code...