0
votes

From what I understand, when we try to transfer pageable memory from host to device, cudamemcpy will automatically copy the data into pinned memory (buffer) and then transfer to device.

Many suggestions for code optimization involve using pinned memory instead of pageable memory. I don't understand how it would be faster. While the transfer itself will be faster since its directly from pinned memory rather than requiring a copy prior to transfer, you still have to copy content from pageable memory to pinned memory yourself which creates a lot of overhead. Am I misunderstanding the situation? Could someone explain to me why using pinned memory would be faster given the overhead it would incur from copying as well as the fact that it sounds like we are just manually doing what cudamemcpy can do automatically?

1

1 Answers

1
votes

Pinned memory is required if you want to overlap copy and compute.

In some situations, pinned memory may also provide a performance benefit. This is often noticeable if we can reuse the buffers that are used to transfer data between host and device.

you still have to copy content from pageable memory to pinned memory yourself which creates a lot of overhead.

I don't think you have to transfer data from pageable memory to pinned memory in every conceivable case.

Based on what appears to be dialog on your cross-posting here, I'll provide the following worked example showing a comparison between pinned and non-pinned memory:

$ cat t113.cu
#include <stdio.h>
#include <stdlib.h>

typedef double my_T;
const int ds = 1024;
const int num_iter = 100;
const int block_dim = 16;

// C = A * B
// naive!!
template <typename T>
__global__ void mm(const T * __restrict__ A, const T * __restrict__ B, T * __restrict__ C, size_t d)
{
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  int idy = threadIdx.y+blockDim.y*blockIdx.y;

  if ((idx < d) && (idy < d)){
    T temp = 0;
    for (int i = 0; i < d; i++)
      temp += A[idy*d + i]*B[i*d + idx];
    C[idy*d + idx] = temp;
    }
}

int main(int argc, char *argv[]){

  int use_pinned = 0;
  if (argc > 1) use_pinned = atoi(argv[1]);
  if (use_pinned) printf("Using pinned memory\n");
  else printf("Using pageable memory\n");
  my_T *d_A, *d_B, *d_C, *h_A, *h_B, *h_C;
  int bs = ds*ds*sizeof(my_T);
  cudaMalloc(&d_A, bs);
  cudaMalloc(&d_B, bs);
  cudaMalloc(&d_C, bs);
  if (use_pinned){
    cudaHostAlloc(&h_A, bs, cudaHostAllocDefault);
    cudaHostAlloc(&h_B, bs, cudaHostAllocDefault);
    cudaHostAlloc(&h_C, bs, cudaHostAllocDefault);}
  else {
    h_A = (my_T *)malloc(bs);
    h_B = (my_T *)malloc(bs);
    h_C = (my_T *)malloc(bs);}
  cudaMemset(d_A, 0, bs);
  cudaMemset(d_B, 0, bs);
  memset(h_C, 0, bs);
  dim3 block(block_dim,block_dim);
  dim3 grid((ds+block.x-1)/block.x, (ds+block.y-1)/block.y);
  for (int iter = 0; iter<num_iter; iter++){
    mm<<<grid, block>>>(d_A, d_B, d_C, ds);
    if (iter > 1) if (h_C[0] != (my_T)((iter-2)*(iter-2)*ds)) printf("validation failure at iteration %d, was %f, should be %f\n", iter, h_C[0], (my_T) ((iter-2)*(iter-2)*ds));
    for (int i = 0; i < ds*ds; i++) {h_A[i] = iter; h_B[i] = iter;}
    cudaMemcpy(h_C, d_C, bs, cudaMemcpyDeviceToHost);
    cudaMemcpy(d_A, h_A, bs, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, bs, cudaMemcpyHostToDevice);}
  printf("%s\n", cudaGetErrorString(cudaGetLastError()));
}
$ nvcc -arch=sm_60 -o t113 t113.cu
$ time ./t113
Using pageable memory
no error

real    0m1.987s
user    0m1.414s
sys     0m0.571s
$ time ./t113 1
Using pinned memory
no error

real    0m1.487s
user    0m0.903s
sys     0m0.579s
$

CUDA 9.1, CentOS 7.4, Tesla P100

Briefly, this code is doing 100 "naive" matrix-multiply operations on the GPU. At each iteration, we are launching the matrix-multiply on the GPU, and while that is being done we are updating the host (input) data. When the matrix multiply is complete, we transfer the results to the host, then transfer the new input data to the device, then perform another iteration.

I'm not suggesting that this code is perfectly optimized. The kernel for example is a naive implementation (if you wanted a fast matrix multiply, you should use CUBLAS). And if you were serious about optimization, you would probably want to overlap the data transfers in this example with device code execution. In that case, you would be forced to use pinned buffers anyway. But it's not always possible to achieve overlap of copy and compute in every application, and in certain cases (such as the provided example) using pinned buffers can help, performance-wise.

If you insist on comparing to a case where you must first copy data from a non-pinned buffer to a pinned buffer, then there may be no benefit. But without a concrete example of what you have in mind, it's not obvious to me that you can't do all of your work using only pinned host buffers (for data you intend to send to/from the GPU). If you are reading data in from disk or network, you could read it into pinned buffers. If you are doing some host calculations first, you could be using pinned buffers. Then send those pinned buffer data to the GPU.