I have a matrix and I would like to use CUDA and in the fastest possible way compute the column-wise mean (boils down to be simply the sum), i.e., return a row vector containing the mean of every column in that matrix. A sum reduction implementation for computing the sum of a single column vector looks like this:
template<typename T>
__global__ void kernelSum(const T* __restrict__ input, T* __restrict__ per_block_results, const size_t n) {
extern __shared__ T sdata[];
size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
// load input into __shared__ memory
T x = 0.0;
if (tid < n) {
x = input[tid];
}
sdata[threadIdx.x] = x;
__syncthreads();
// contiguous range pattern
for(int offset = blockDim.x / 2; offset > 0; offset >>= 1) {
if(threadIdx.x < offset) {
// add a partial sum upstream to our own
sdata[threadIdx.x] += sdata[threadIdx.x + offset];
}
// wait until all threads in the block have
// updated their partial sums
__syncthreads();
}
// thread 0 writes the final result
if(threadIdx.x == 0) {
per_block_results[blockIdx.x] = sdata[0];
}
}
and this is invoked as:
int n = ... // vector size
const int BLOCK_SIZE = 1024;
int number_of_blocks = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;
double* per_block_results = NULL;
cudaMalloc((void**) &per_block_results, sizeof(double)*(number_of_blocks + 1));
// launch one kernel to compute, per-block, a partial sum
kernelSum<double> <<<number_of_blocks, BLOCK_SIZE, BLOCK_SIZE*sizeof(double)>>>(a, per_block_results, n);
// launch a single block to compute the sum of the partial sums
kernelSum<double> <<<1, number_of_blocks, number_of_blocks*sizeof(double)>>>(per_block_results, per_block_results + number_of_blocks, number_of_blocks);
I could generalize this kernel to matrices of any number of columns but I'm limited by the shared memory. My GPU has compute capability 3.5
so it has 48KB
of shared memory and a maximum block size of 1024
i.e. number of threads per block. Since I am interested in double-precision, I have 48*1024/8= 6144
maximum doubles of shared memory. Since the reduction is done per block, I can have a maximum of 6144 (doubles in shared memory) / 1024 (block size) = 6
columns for which I can compute the sum reduction simultaneously. Reducing the block size then would allow to compute more columns simultaneously e.g. 6144 (doubles in shared memory) / 512 (block size) = 12
.
Would this more complex strategy beat the simple CPU loop over every column of the matrix and invoke the sum reduction. Is there yet another better way to do this?
1
's usingcublas<t>gemv()
. - Vitalitycublas<t>gemv()
in that you are loading the dummy1
's vector and multiplying the elements of the matrixA
by them. But the cuBLAS routines are highly optimized and it would be interesting to assess whether your own implementation is faster or not than the "naive" cuBLAS one we are talking about. Perhaps, you can be interested to make a comparison to talonmie's solution and post your results here... - Vitality