I am writing a simple cuda program recently, the kernel function is below:
#define BLOCK_SIZE 16
#define RADIOUS 7
#define SM_SIZE BLOCK_SIZE+2*RADIOUS
__global__ static void DarkChannelPriorCUDA(const float* r, size_t ldr, const float* g, size_t ldg, const float* b, size_t ldb, float * d, size_t ldd, int n, int m)
{
__shared__ float R[SM_SIZE][SM_SIZE];
__shared__ float G[SM_SIZE][SM_SIZE];
__shared__ float B[SM_SIZE][SM_SIZE];
const int tidr = threadIdx.x;
const int tidc = threadIdx.y;
const int bidr = blockIdx.x * BLOCK_SIZE;
const int bidc = blockIdx.y * BLOCK_SIZE;
int i, j ,tr, tc;
for( i = 0; i < SM_SIZE; i += BLOCK_SIZE)
{
tr = bidr-RADIOUS+i+tidr;
for( j = 0; j < SM_SIZE; j += BLOCK_SIZE)
{
tc = bidc-RADIOUS+j+tidc;
if(tr <0 || tc<0 || tr>=n || tc>=m)
{
R[i][j]=1e20;
G[i][j]=1e20;
B[i][j]=1e20;
}
else
{
R[i][j]=r[tr*ldr+tc];
G[i][j]=g[tr*ldg+tc];
B[i][j]=b[tr*ldb+tc];
}
}
}
__syncthreads();
float results = 1e20;
for(i = tidr; i <= tidr + 2*RADIOUS; i++)
for(j = tidc; j <= tidc + 2*RADIOUS; j++)
{
results = results < R[i][j] ? results : R[i][j];
results = results < G[i][j] ? results : G[i][j];
results = results < B[i][j] ? results : B[i][j];
}
d[(tidr + bidr) * ldd + tidc + bidc] = results;
}
this function read r, g, b three 2d matrix of n*m as input, output a matrix d of n*m, each element of d[i][j]'s value is equal to the minimal value among r, g, b three matrix which covered by the window of (2*RADIOUS+1)*(2*RADIOUS+1) with center (i,j).
in order to speed up, i used a shared memory to store a small amount of value for each block. each block has 16*16 threads, each single thread calculate the result for one element of maxtrix d. shared memory need to store (BLOCK_SIZE+2*RADIOUS)*(BLOCK_SIZE+2*RADIOUS) elements of r, g, b.
But the result is wrong, the value in shared memory R, G and B is different from r, g and b in global memory. It seems that the data in global memory never tansfer to shared memory successful, I can't understand why it happens.