According to PTX manual here is how prefetch works in PTX:
You can embed the PTX instructions into the CUDA kernel. Here is a tiny sample from NVIDIA's documentation:
__device__ int cube (int x)
{
int y;
asm("{\n\t" // use braces for local scope
" .reg .u32 t1;\n\t" // temp reg t1,
" mul.lo.u32 t1, %1, %1;\n\t" // t1 = x * x
" mul.lo.u32 %0, t1, %1;\n\t" // y = t1 * x
"}"
: "=r"(y) : "r" (x));
return y;
}
You may come to conclude with the following prefetch function in C:
__device__ void prefetch_l1 (unsigned int addr)
{
asm(" prefetch.global.L1 [ %1 ];": "=r"(addr) : "r"(addr));
}
NOTICE: You need the GPU of Compute Capability 2.0 or higher for prefetch. Pass proper compile flags accordingly -arch=sm_20