Does cuda memcpy from host to host perform synchronization?

Question

If I call cudaMemcpy from host memory to host memory, will it first synchronize the device? Is there any difference between the cuda memcpy call and the ordinary C++ function memcpy? I know that in case I want to do a memcpy 2D between host to host, I have to use the cuda call, since there is no such function in C++. Is there any other ones?

Roger Dahl Roger Dahl · Accepted Answer · 2014-03-16T01:57:08

If I call cudaMemcpy from host memory to host memory, will it first synchronize the device?

I verified that cudaMemcpy() with cudaMemcpyHostToHost does synchronize with the following code:

#include <cuda.h>

#define check_cuda_call(ans) { _check((ans), __FILE__, __LINE__); }
inline void _check(cudaError_t code, char *file, int line)
{
  if (code != cudaSuccess) {
    fprintf(stderr,"CUDA Error: %s %s %d\n", cudaGetErrorString(code), file, line);
    exit(code);
  }
}

__device__ clock_t offset;

__global__ void clock_block(clock_t clock_count)
{
  clock_t start_clock = clock();
  clock_t clock_offset = 0;
  while (clock_offset < clock_count) {
    clock_offset = clock() - start_clock;
  }
  offset = clock_offset;
}

int main(int argc, char *argv[])
{
  int *A;
  check_cuda_call(cudaMallocHost(&A, 1 * sizeof(int)));
  int *B;
  check_cuda_call(cudaMallocHost(&B, 1 * sizeof(int)));

  clock_block<<<1,1>>>(1000 * 1000 * 1000);

  //check_cuda_call(cudaDeviceSynchronize());
  check_cuda_call(cudaMemcpy(&A, &B, 1 * sizeof(int), cudaMemcpyHostToHost));
}

With a blocking call after the kernel launch, the app waits for around 1 second on my card. Without a blocking call, it exits immediately.

Is there any difference between the cuda memcpy call and the ordinary C++ function memcpy?

Yes, the synchronization, which also causes the cudaMemcpy() with cudaMemcpyHostToHost to be able to return errors from previous async calls, makes it different from plain memcpy().

I know that in case I want to do a memcpy 2D between host to host, I have to use the cuda call, since there is no such function in C++. Is there any other ones?

You might be able to use cudaMemcpyAsync() with cudaMemcpyHostToHost to do copies on the host without blocking the CPU, but I haven't tested it.

Does cuda memcpy from host to host perform synchronization?

1 Answers