Answer based on your comment:
when we to want perform computation on large data ideally we'll send max data to GPU,perform computation,send it back to CPU i.e SEND,COMPUTE,SEND(back to CPU) now whn it sends back to CPU GPU has to stall,now my plan is given CU program,say it runs in entire global mem,i'll compel it to run it in half of the global mem so that rest of the half i can use for data prefetching,so while computation is being performed in one half simultaneously i cn prefetch data in otherhalf.so no stalls will be there..now tell me is it feasible to do?performance will be degraded or upgraded?should enhance..
CUDA streams were introduced to enable exactly this approach.
If your compoutation is rather intensive, then yes --- it can greatly speed up your performance. On the other hand, if data transfers take, say, 90% of your time, you will save only on computation time - that is - 10% tops...
The details, including examples, on how to use streams is provided in CUDA Programming Guide.
For version 4.0, that will be section "3.2.5.5 Streams", and in particular "3.2.5.5.5 Overlapping Behavior" --- there, they launch another, asynchronous memory copy, while a kernel is still running.