Although I understand the logic behind parallel reduction described in this paper, I can't seem to be able to run it for a simple example where the input array has size
1s.
Here is what I achieved so far. Keep in mind that I'm using the thrust library to manage input and output data.
#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <ctime>
#include <sys/time.h>
#include <sstream>
#include <string>
#include <fstream>
using namespace std;
__global__ void reduce0(int *g_idata, int *g_odata){
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();
for(unsigned int s=1; s < blockDim.x; s *= 2) {
if (tid % (2*s) == 0) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
int main(void){
int size = 10;
thrust::host_vector<int> data_h_i(size, 1);
//initialize the data, all values will be 1
//so the final sum will be equal to 10
int threadsPerBlock = 256;
int totalBlocks = size/threadsPerBlock + 1;
dim3 dimGrid(totalBlocks,1,1);
dim3 dimBlock(threadsPerBlock, 1, 1);
thrust::device_vector<int> data_v_i = data_h_i;
thrust::device_vector<int> data_v_o(size);
int* output = thrust::raw_pointer_cast(data_v_o.data());
int* input = thrust::raw_pointer_cast(data_v_i.data());
reduce0<<<dimGrid, dimBlock>>>(input, output);
data_v_i.clear();
data_v_i.shrink_to_fit();
thrust::host_vector<int> data_h_o = data_v_o;
data_v_o.clear();
data_v_o.shrink_to_fit();
cout<<data_h_o[0]<<endl;
return 0;
}
the code is simple, I create a host_vector
of size size
and initialize all the values to 1.
then I say that we need 256 threads per each block and find dynamically the amount of blocks needed for my example.
To keep things simple, I create an array of 10 values only, which means that we are going to require only one block. So one kernel invocation will be enough to produce the final result.
My questions are the following:
Question 1
After compiling the above example (nvcc -O3 reduction.cu -arch=sm_21
) and entering ./a.out
I get the following message:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): unspecified launch failure
I'm not sure what's going on here, but it seems to me that the error comes from the line
sdata[tid] = g_idata[i]
The kernel is an exact copy of the kernel described in the paper so I'm not sure what changes are required in order to fix this problem.
Question 2
If we fix the first problem, how could we make the above code work for arbitrary size of input array? If for example our size
is more than 256, then we would need at least two blocks, so each block will give an output that will then have to be combined with the outputs of other blocks. In the paper it says that we would need multiple invocations of the kernel, however I'm not sure how this can be done dynamically.
Thank you in advance
EDIT1: For Question 1 it seems that I don't allocate memory for the shared memory correctly. Calling the kernel like that: reduce0<<<dimGrid, dimBlock, size*sizeof(int)>>>(input, output);
and also checking to see if tid
is not out of range.
makes the code work properly.
The new kernel is the following:
__global__ void reduce0(int *g_idata, int *g_odata, int size){
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
if(tid<size){
sdata[tid] = g_idata[i];
__syncthreads();
for(unsigned int s=1; s < size; s *= 2) {
if (tid % (2*s) == 0) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
}
I'm still not sure about Question 2 though.