1
votes

I'm working on data prefetching in nVidia CUDA. I read some documents on prefetching on device itself i.e. Prefetching from shared memory to cache.

But I'm interested in data prefetching between CPU and GPU. Can anyone connect me with some documents or something regarding this matter. Any help would be appreciated.

3
Your question is way too broad in its current form - try asking a more specific question. You might also want to check out the nVidia developer forums at developer.nvidia.com. - Paul R
Ok..how can I add prefetch instruction in given CUDA program?? - username_4567
This is still very vague - prefetch what to what exactly ? For what purpose ? On what generation of GPU ? - Paul R

3 Answers

1
votes

Answer based on your comment:

when we to want perform computation on large data ideally we'll send max data to GPU,perform computation,send it back to CPU i.e SEND,COMPUTE,SEND(back to CPU) now whn it sends back to CPU GPU has to stall,now my plan is given CU program,say it runs in entire global mem,i'll compel it to run it in half of the global mem so that rest of the half i can use for data prefetching,so while computation is being performed in one half simultaneously i cn prefetch data in otherhalf.so no stalls will be there..now tell me is it feasible to do?performance will be degraded or upgraded?should enhance..

CUDA streams were introduced to enable exactly this approach.

If your compoutation is rather intensive, then yes --- it can greatly speed up your performance. On the other hand, if data transfers take, say, 90% of your time, you will save only on computation time - that is - 10% tops...

The details, including examples, on how to use streams is provided in CUDA Programming Guide. For version 4.0, that will be section "3.2.5.5 Streams", and in particular "3.2.5.5.5 Overlapping Behavior" --- there, they launch another, asynchronous memory copy, while a kernel is still running.

0
votes

Perhaps you would be interested in the asynchronous host/device memory transfer capabilities of CUDA 4.0? You can overlap host/device memory transfers and kernels by using page-locked host memory. You could use this to...

  1. Copy working set #1 & #2 from host to device.
  2. Process #i, promote #i+1, and load #i+2 - concurrently.

So you could be streaming data in and out of the GPU and computing on it all at once (!). Please refer to the CUDA 4.0 Programming Guide and CUDA 4.0 Best Practices Guide for more detailed information. Good luck!

0
votes

Cuda 6 will eliminate the need to copy, ie the copying will be automatic. however you may still benefit from prefetching.

In a nutshell you want the data for the "next" computation transferring while you complete the current computation. to achieve that you need to have at least two threads on the CPU, and some kind of signalling scheme (to know when to send the next data). Chunking will of course play a big role and affect performance.

The above may be easier on an APU (CPU+GPU on the same die) as the need to copy is eliminated as both processors can access the same memory.

If you want to find some papers on GPU prefetching just use google scholar.