Pinned memory is required if you want to overlap copy and compute.
In some situations, pinned memory may also provide a performance benefit. This is often noticeable if we can reuse the buffers that are used to transfer data between host and device.
you still have to copy content from pageable memory to pinned memory yourself which creates a lot of overhead.
I don't think you have to transfer data from pageable memory to pinned memory in every conceivable case.
Based on what appears to be dialog on your cross-posting here, I'll provide the following worked example showing a comparison between pinned and non-pinned memory:
$ cat t113.cu
#include <stdio.h>
#include <stdlib.h>
typedef double my_T;
const int ds = 1024;
const int num_iter = 100;
const int block_dim = 16;
// C = A * B
// naive!!
template <typename T>
__global__ void mm(const T * __restrict__ A, const T * __restrict__ B, T * __restrict__ C, size_t d)
{
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int idy = threadIdx.y+blockDim.y*blockIdx.y;
if ((idx < d) && (idy < d)){
T temp = 0;
for (int i = 0; i < d; i++)
temp += A[idy*d + i]*B[i*d + idx];
C[idy*d + idx] = temp;
}
}
int main(int argc, char *argv[]){
int use_pinned = 0;
if (argc > 1) use_pinned = atoi(argv[1]);
if (use_pinned) printf("Using pinned memory\n");
else printf("Using pageable memory\n");
my_T *d_A, *d_B, *d_C, *h_A, *h_B, *h_C;
int bs = ds*ds*sizeof(my_T);
cudaMalloc(&d_A, bs);
cudaMalloc(&d_B, bs);
cudaMalloc(&d_C, bs);
if (use_pinned){
cudaHostAlloc(&h_A, bs, cudaHostAllocDefault);
cudaHostAlloc(&h_B, bs, cudaHostAllocDefault);
cudaHostAlloc(&h_C, bs, cudaHostAllocDefault);}
else {
h_A = (my_T *)malloc(bs);
h_B = (my_T *)malloc(bs);
h_C = (my_T *)malloc(bs);}
cudaMemset(d_A, 0, bs);
cudaMemset(d_B, 0, bs);
memset(h_C, 0, bs);
dim3 block(block_dim,block_dim);
dim3 grid((ds+block.x-1)/block.x, (ds+block.y-1)/block.y);
for (int iter = 0; iter<num_iter; iter++){
mm<<<grid, block>>>(d_A, d_B, d_C, ds);
if (iter > 1) if (h_C[0] != (my_T)((iter-2)*(iter-2)*ds)) printf("validation failure at iteration %d, was %f, should be %f\n", iter, h_C[0], (my_T) ((iter-2)*(iter-2)*ds));
for (int i = 0; i < ds*ds; i++) {h_A[i] = iter; h_B[i] = iter;}
cudaMemcpy(h_C, d_C, bs, cudaMemcpyDeviceToHost);
cudaMemcpy(d_A, h_A, bs, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, bs, cudaMemcpyHostToDevice);}
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
}
$ nvcc -arch=sm_60 -o t113 t113.cu
$ time ./t113
Using pageable memory
no error
real 0m1.987s
user 0m1.414s
sys 0m0.571s
$ time ./t113 1
Using pinned memory
no error
real 0m1.487s
user 0m0.903s
sys 0m0.579s
$
CUDA 9.1, CentOS 7.4, Tesla P100
Briefly, this code is doing 100 "naive" matrix-multiply operations on the GPU. At each iteration, we are launching the matrix-multiply on the GPU, and while that is being done we are updating the host (input) data. When the matrix multiply is complete, we transfer the results to the host, then transfer the new input data to the device, then perform another iteration.
I'm not suggesting that this code is perfectly optimized. The kernel for example is a naive implementation (if you wanted a fast matrix multiply, you should use CUBLAS). And if you were serious about optimization, you would probably want to overlap the data transfers in this example with device code execution. In that case, you would be forced to use pinned buffers anyway. But it's not always possible to achieve overlap of copy and compute in every application, and in certain cases (such as the provided example) using pinned buffers can help, performance-wise.
If you insist on comparing to a case where you must first copy data from a non-pinned buffer to a pinned buffer, then there may be no benefit. But without a concrete example of what you have in mind, it's not obvious to me that you can't do all of your work using only pinned host buffers (for data you intend to send to/from the GPU). If you are reading data in from disk or network, you could read it into pinned buffers. If you are doing some host calculations first, you could be using pinned buffers. Then send those pinned buffer data to the GPU.