I can't figure out a way to transpose a non-squared matrix using shared memory in CUDA C. (I am new to CUDA C and C)
On the website:
https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/
an efficient way was shown how to transpose a matrix (Coalesced Transpose Via Shared Memory). But it only works for squared matrices.
Also Code is provided on github (same as on the blog).
On Stackoverflow there is a similar question. There TILE_DIM = 16
is set. But with that implementation every thread just copies one element of the matrix to the result matrix.
This is my current implementation:
__global__ void transpose(double* matIn, double* matTran, int n, int m){
__shared__ double tile[TILE_DIM][TILE_DIM];
int i_n = blockIdx.x*TILE_DIM + threadIdx.x;
int i_m = blockIdx.y*TILE_DIM + threadIdx.y; // <- threadIdx.y only between 0 and 7
// Load matrix into tile
// Every Thread loads in this case 4 elements into tile.
int i;
for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
if(i_n < n && (i_m+i) < m){
tile[threadIdx.y+i][threadIdx.x] = matIn[n*(i_m+i) + i_n];
} else {
tile[threadIdx.y+i][threadIdx.x] = -1;
}
}
__syncthreads();
for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
if(tile[threadIdx.x][threadIdx.y+i] != -1){ // <- is there a better way?
if(true){ // <- what should be checked here?
matTran[n*(i_m+i) + i_n] = tile[threadIdx.x][threadIdx.y+i];
} else {
matTran[m*i_n + (i_m+i)] = tile[threadIdx.x][threadIdx.y+i];
}
}
}
}
where 4 elements are copied from a thread into the tile. Also four elements from the tile are copied back into the result matrix.
Here the Kernel-Configuration <<<a, b>>>
:
where a: (ceil(n/TILE_DIM), ceil(n/TILE_DIM)) (-> is casted to doubles) and
b: (TILE_DIM, BLOCK_ROWS) (-> (32, 8))
I am currently using the if(tile[threadIdx.x][threadIdx.y+i] != -1)
-statement to determine, which thread should copy to the result matrix (There might be another way). As for my current knowledge, this behaves as follows: In a block, the ThreadIdx (x, y)
copies the data into the tile and the ThreadIdx (y, x)
copies the data back into the result matrix.
I inserted another if
-statement to determine where to copy the data, as there are 2(?) possible destinations, depending on the ThreadIdx. Currently true
is inserted there, but i tried many different things. The best i could come up with was if(threadIdx.x+1 < threadIdx.y+i)
, which transposes a 3x2
-matrix succesfully.
Can someone please explain, what i am missing by writing back into the result matrix? Obviously only one destination is correct. Using
matTran[n*(i_m+i) + i_n] = tile[threadIdx.x][threadIdx.y+i];
as on the blog mentioned should be correct, but I can't figure out, why it is not working for non-squared matrices?
nx != ny
, presuming that both are divisible withTILE_DIM
? I believe the part where you filltile
should be the same as with the code you posted, since transposing happens after__syncthreads
. Thetile
array simply holds a single tile, which means that ifi_n
andi_m
are within the input bounds, it must be filled with correct data without anyif
s. Shouldn't the kernel configuration be(n/TILE_DIM, m/TILE_DIM)
? – Groo(n/TILE_DIM, m/TILE_DIM)
only works fornx
andny
, which are divisible byTILE_DIM
. The kernel configuration(ceil(n/TILE_DIM), ceil(n/TILE_DIM))
ensures that we have more threads available ifnx
orny
are NOT divisible byTILE_DIM
, so this does just provide us some extra threads, which may be never used. I removed allif
s as you mentioned and the code didn't work as excepted. Everything besidesnx != 32
andny != 32
did not transposed the matrix. I just over complicated a lot. But it helped! – Luxii