cuda out-of-core implementation using a circular buffer

Question

I'm trying to do out-of-core between GPU memory and CPU memory. For example, I have blocks of data each is 1GB, and I need to process 1000 of such blocks in order, each is done by a kernel launch. Assume the processing must be done one by one, because the n'th kernel launch needs to use the result produced by the (n-1)'th kernel, which is stored in the (n-1)'th block, except the first kernel launch. So I'm thinking of using a circular buffer on GPU to store the most recent 5 blocks, and use events to synchronize between the data stream and the task stream. The data stream prepares the data and the task stream launches the kernels. The code is illustrated as the following.

const int N_CBUF = 5, N_TASK = 1000;

// Each pointer points to a data block of 1GB
float* d_cir_buf[N_CBUF];
float* h_data_blocks[N_TASK];

// The data stream for transfering data from host to device.
// The task stream for launching kernels to process the data.
cudaStream_t s_data, s_task;

// The data events for the completion of each data transfer.
// The task events for the completion of each kernel execution.
cudaEvent_t e_data[N_TASK], e_task[N_TASK];

// ... code for creating the streams and events.

for (int i = 0; i < N_TASK; i++) {
  // Data transfer should not overwritten the data needed by the kernels.
  if (i >= N_CBUF) {
    cudaStreamWaitEvent(s_data, e_task[i-N_CBUF+1]);
  }
  cudaMemcpyAsync(d_cir_buf[i % N_CBUF], h_data_blocks[i], ..., cudaMemcpyHostToDevice, s_data);
  cudaEventRecord(e_data[i], s_data);

  cudaStreamWaitEvent(s_task, e_data[i]);

  // Pass the current and the last data block to the kernel.
  my_kernel<<<..., s_task>>>(d_cir_buf[i % N_CBUF], 
    i == 0 ? 0 : d_cir_buf[(i+N_CBUF-1)%N_CBUF]);
  cudaEventRecord(e_task[i], s_task);
}

I'm wondering if this is even a valid idea, or is there anything completely wrong? Also, the CUDA programming guide mentioned that if there is memcpy from two different host memory address to the same device address, then there will be no concurrent execution, does this matter in my case? In particular, if the memory for d_cir_buf is allocated as a whole big block and then split into 5 pieces, would that count as "the same memory address in device", causing concurrency to fail? Also, in my case the (n+5)'th data transfer will go to the same address as the n'th data transfer, however, given the synchronization required, there won't be two such transfers to execute at the same time. So is this OK?

Sigi Sigi · Accepted Answer · 2014-03-01T22:48:06

I have the feeling that your problem is best suited to double buffering:

two streams
upload data1 in stream1
run kernel on data1 in stream1
upload data2 in stream2
run kernel on data2 in stream2

... And so on

Kernel in stream2 can overlap with data transfers in strezm 1 and vice versa

cuda out-of-core implementation using a circular buffer

1 Answers