0
votes

Each thread in a block can have different set (and size) of results. At the moment i am allocating fixed size of device memory; think per-thread.

Meaning, for XX threads i Have to allocate XX * max_result_count * data_structure * sizeof(int), my data contains integers. Each thread access its memory block (offset) by calculating int i = blockDim.x * blockIdx.x + threadIdx.x; and multiplying it with max_result_count*data_structure, for integer array;

In the real world this means huge waste of device memory, because some sets are close to 0, some are not. For example, i Have to allocate under 2GB of device memory to be able to store an equivalent of 300MB of results.

Any ideas on how to rework this ?

For example, each thread locks mutex, increments actual res_count, writes data into shared memory block, unlocks mutex.

[Problem solved, thanks, guys !]

2

2 Answers

1
votes

You've already hinted in your question at one possible approach:

#define DSIZE (100*1048576)

__device__ unsigned int buffer_index = 0;
__device__ int *buffer_data;

In your host code:

int *buffer_data_temp;
cudaMalloc(&buffer_data_temp, sizeof(int)*DSIZE); 
cudaMemcpyToSymbol(buffer_data, &buffer_data_temp, sizeof(int *));

In your thread code:

unsigned int my_buffer_offset = atomicAdd(&buffer_index, size_of_my_thread_data);
assert((my_buffer_offset+size_of_my_thread_data) < DSIZE);
memcpy(buffer_data+my_buffer_offset, my_thread_data, size_of_my_thread_data*sizeof(int));

(disclaimer: coded in browser, not tested)

It's not necessary to use a mutex, for example around the memcpy operation. Once we have reserved the starting and ending points of our allocation with the atomicAdd, the threads will not step on each other, even if all are writing data, because they are writing to separate regions within buffer_data.

EDIT: Here's a complete example:

#include <stdio.h>
#include <assert.h>
#define DSIZE (100*1048576)
#define nTPB 32
#define BLKS 2

__device__ unsigned int buffer_index = 0;

__global__ void update_buffer(int *buffer_data){
  const unsigned int size_of_my_thread_data = 1;
  unsigned int my_buffer_offset = atomicAdd(&buffer_index, size_of_my_thread_data);
  assert((my_buffer_offset+size_of_my_thread_data) < DSIZE);
  int my_thread_data[size_of_my_thread_data];
  my_thread_data[0] = (blockIdx.x*10000) + threadIdx.x;
  memcpy(buffer_data+my_buffer_offset, my_thread_data, size_of_my_thread_data*sizeof(int));
}

int main(){

  int *h_buffer_data, *d_buffer_data;
  cudaMalloc(&d_buffer_data, sizeof(int)*DSIZE);
  update_buffer<<<BLKS, nTPB>>>(d_buffer_data);
  unsigned int result_size;
  cudaMemcpyFromSymbol(&result_size, buffer_index, sizeof(unsigned int));
  h_buffer_data = (int *)malloc(sizeof(int)*result_size);
  cudaMemcpy(h_buffer_data, d_buffer_data, result_size*sizeof(int),cudaMemcpyDeviceToHost);
  for (int i = 0; i < result_size; i++)
    printf("%d\n", h_buffer_data[i]);
  return 0;
}
1
votes

Rewrite the kernel and calling function to calculate a part of the required points (obviously, you'll have to change the number of blocks per launch, etc.).

int offset = 0;
for(int i = 0; i < numKernelLaunches; i++) {
    yourKernel<<<numBlocks,threadsPerBlock>>>(offset, /* your other parameters */);
    offset += numBlocks*threadsPerBlock;
    cudaDeviceSynchronize();
}

and in yourKernel you keep int i = blockDim.x * blockIdx.x + threadIdx.x; as the index for the global memory access and i + offset for the id of your data position.