4
votes

Here in the documentation, it is stated that prefetch and prefetchu ptx instructions "prefetch line containing a generic address at a specified level of memory hierarchy, in specified state space". It is also mentioned that the syntax is

prefetch{.space}.level [a]; // prefetch to data cache
prefetchu.L1 [a]; // prefetch to uniform cache

.space = { .global, .local };
.level = { .L1, .L2 };

I would like to know what uniform cache is being referred to here; while the syntax (in the 2nd line) specifies the data is going to be prefetched into L1? Isn't prefetchu redundant while there exists prefetch instruction that allows prefetching to L1 as well? For example what is the difference between below lines of code?

prefetch.global.L1  [a];  // a maps to global memory.
prefetchu.L1  [a];  // a maps to global memory.
1
I am not sure, but I think "uniform cache" refers to the "constant cache", which has a broadcast feature. It allows the same data to be broadcast to all threads in a warp provided the access is uniform, i.e. all threads in the warp access the same address. While on older architectures, the constant cache was separate from the regular L1, I believe it has been absorbed into the generic read-only cache on Maxwell. Again, I am not sure about that. Why are the details of these prefetch instructions important to your use case? What are you hoping to accomplish?njuffa
@njuffa Inside my program, a few threads inside the warp make non-coalesced global reads at a certain point. There are non-dependent instructions that can be scheduled after this reads without any need to wait for the read content. So I'm thinking I can have a prefetch operation right after when the address is discovered and then schedule my non-dependent operations. When the content of that address is needed, hopefully it can be found in the cache. Basically I'm trying to hide the memory access latency.Farzad
@njuffa While I'm guessing the NVCC compiler is already doing such thing as an optimization, I think the designer of the program might be able to better reason about where can be better to fetch such data. For example, if on a Kepler device, the program doesn't use shared memory and there's no register spilling, it's probably better to prefetch on L1 if only few threads inside the SM is making such accesses.Farzad
Just agreeing with @njuffa, I think prefetchu is using the same mechanism as LDU. Not sure it has any meaning on a non cc2.x device. I suspect prefetch in general could be interpreted by the ptxas compiler in more than one way. Inspecting the SASS the emanates from these instructions (if any) might be instructive.Robert Crovella

1 Answers

2
votes

Uniform cache is indeed the constant cache, as noted in "CUDA Application Design and Development" book as the following:

..."the SM also contains constant (labaled Uniform cache " (sic)...

https://books.google.com.tr/books?id=Y-XmJO2uwvMC&pg=PA112&lpg=PA112#v=onepage&q&f=false