0
votes

I am trying to load a 3d array into pycuda (i'm going to load images). I want each thread to handle all the channels of a single pixel using a for loop(this is an algorithmic requirement).

So far I have this working:

from pycuda.compiler import SourceModule
mod = SourceModule(open("./cudacode.cu").read())  

multiply_them = mod.get_function("multiply_them")

rows,cols = 800,400
a = numpy.random.randn(rows,cols,3).astype(numpy.float32)
b = numpy.random.randn(rows,cols,3).astype(numpy.float32)



threads = 20
blocks = 16

dest = numpy.zeros_like(a)
multiply_them(
        drv.Out(dest), drv.In(a), drv.In(b),np.int32(rows), np.int32(cols),  # notice it is rows, cols
        block=(blocks,blocks,1), grid=(rows//blocks,cols//blocks))

print( dest- 2*a*b)
print(np.unique( dest- 2*a*b))

and my cuda code is :

__global__ void multiply_them(float *dest, float *a, float *b,int cols,int rows) # notice it is cols , rows  and not rows, cols 
{
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if(row<rows && col <cols){

    dest[3* (row*cols + col)] =   2.0* a[3* (row*cols + col)  ]* b[3* (row*cols + col)  ] ; ## first channel of some pixel
    dest[3* (row*cols + col)+1] = 2.0* a[3* (row*cols + col)+1]* b[3* (row*cols + col)+1] ; ## second channel of that pixel
    dest[3* (row*cols + col)+2] = 2.0* a[3* (row*cols + col)+2]* b[3* (row*cols + col)+2] ; ## 3rd channel of that pixel 


    }
}

notice that in my cuda function, the rows and columns are switched. This whole code works fine the way it is.

I am able to say my code works fine because print(np.unique( dest- 2*a*b)) prints 0.0

The dest[3* (row*cols + col)] style of indexing , I found from here https://wiki.tiker.net/PyCuda/Examples/GlInterop

My question is, why does max rows and max cols need to be switched in order for this to work ?

And what is the more logically correct way to get to do what I want?

I am new to cuda , so please be patient as I may ask questions that will probably be very stupid

1

1 Answers

1
votes

Numpy arrays use row-major ordering by default. Threads within CUDA blocks are numbered so that threadIdx.x is the fastest varying dimension, and threadIdx.z is the slowest (functionally equivalent to column major ordering). So if you wish to access row-major ordered data, as you would in a default ordered numpy array, and maintain access ordering which allows memory coalescing, then you should "reverse" the use of the x and y dimensions within a block. If your numpy array was column major ordered (order='F'), then you would not do this.

That is why your code works correctly when your intuition is that it should not.