I am trying to load a 3d array into pycuda (i'm going to load images). I want each thread to handle all the channels of a single pixel using a for loop(this is an algorithmic requirement).
So far I have this working:
from pycuda.compiler import SourceModule
mod = SourceModule(open("./cudacode.cu").read())
multiply_them = mod.get_function("multiply_them")
rows,cols = 800,400
a = numpy.random.randn(rows,cols,3).astype(numpy.float32)
b = numpy.random.randn(rows,cols,3).astype(numpy.float32)
threads = 20
blocks = 16
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),np.int32(rows), np.int32(cols), # notice it is rows, cols
block=(blocks,blocks,1), grid=(rows//blocks,cols//blocks))
print( dest- 2*a*b)
print(np.unique( dest- 2*a*b))
and my cuda code is :
__global__ void multiply_them(float *dest, float *a, float *b,int cols,int rows) # notice it is cols , rows and not rows, cols
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if(row<rows && col <cols){
dest[3* (row*cols + col)] = 2.0* a[3* (row*cols + col) ]* b[3* (row*cols + col) ] ; ## first channel of some pixel
dest[3* (row*cols + col)+1] = 2.0* a[3* (row*cols + col)+1]* b[3* (row*cols + col)+1] ; ## second channel of that pixel
dest[3* (row*cols + col)+2] = 2.0* a[3* (row*cols + col)+2]* b[3* (row*cols + col)+2] ; ## 3rd channel of that pixel
}
}
notice that in my cuda function, the rows and columns are switched. This whole code works fine the way it is.
I am able to say my code works fine because print(np.unique( dest- 2*a*b))
prints 0.0
The dest[3* (row*cols + col)]
style of indexing , I found from here https://wiki.tiker.net/PyCuda/Examples/GlInterop
My question is, why does max rows and max cols need to be switched in order for this to work ?
And what is the more logically correct way to get to do what I want?
I am new to cuda , so please be patient as I may ask questions that will probably be very stupid