I have a cuda kernel which works fine when called from a single CPU threads. However when the same is called from multiple CPU threads (~100), most of the kernel seems not be executed at all as the results comes out to be all zeros.Can someone please guide me how to resolve this problem?
In the current version of kernel I am using a cudadevicesynchronize() at the end of kernel call. Will adding a sync command before cudaMalloc() and kernel call be of any help in this case?
There is another thing which need some clarification. i.e. If two CPU threads executes the same cudaMalloc() command, will the later overwrite the former in GPU memory or will they create their own memory?
Thanks in advance for your help
cudaMalloc()
command. A pointer should be passed tocudaMalloc
only once, before it is passed tocudaFree()
– Robert Crovella