My python code has a gpu kernel function which is called multiple times in a for loop from host like this :
for i in range:
gpu_kernel_func(blocksize, grid)
Since this function call requires communication between host and gpu device multiple times which is not efficient, I want to make this as
gpu_kernel_function(){
for(){
computation } ;
}
But this requires extra step to make sure all the blocks in grid are in sync. According to dynamic parallelism, calling a dummy child kernel should ensure that every thread (in whole grid) should finish that child kernel before the code continues running. So I defined another kernel just like gpu_kernel_function and I tried this :
GPUcode = '''
\__global__ gpu_kernel_function() {... }
\__global__ dummy_child_kernel(){ ... }
'''
gpu_kernel_function(){
for() {
computation } ;
dummy_child_kernel(void);
}
But I am getting this error " nvcc fatal : Option '--cubin (-cubin)' is not allowed when compiling for a virtual compute architecture "
I am using Tesla P100 (compute 6.0), python 3.5, cuda.8.0.44. I am compiling my sourcemodule like this :
mod = SourceModule(GPUcode, options=['-rdc=true' ,'-lcudart','-lcudadevrt','--machine=64'],arch='compute_60' )
I tried compute_35 too which gives same error.