Is there a way to copy from device to host within the kernel?
Something like the following code:
__global__ void kernel(int n, double *devA, double *hostA) {
double x = 1.0;
do_computation();
cudaMemcpy(hostA, &x, sizeof(double), cudaMemcpyDeviceToHost);
do_computation();
cudaMemcpy(hostA, devA, sizeof(double), cudaMemcpyDeviceToHost);
}
Is it possible? Based on the CUDA documentation, the cudaMemcpy is not callable from the device, right?
NOTE: I don't want to use the pinned memory. It is low performance since I will constantly check the host variable (memory). So, using pinned memory will issue a page-fault (at best for post-Pascal) that will definitely happen! If both host and device access the same location, it will basically be a ping-pong effect!