0
votes

In the processus of optimizing and profiling a kernel, I noticed that it's L2 and global cache hit frequency was very low (~1.2 % avg.). My kernel typically reads 4 full cache lines per pass per warp, with 3 blocks per SM (So 4 * 32 * 2 = 256 lines of cache per SM per pass of my kernel, that has a variable pass number). The reads are from different regions of global memory, which is obviously then hard to cache. (The pattern of the regions is A, 32 * B, A .....)

It is then made clear that for data that is so "dispersed" and read only 1 time before moving on, L1/L2 cache is almost useless. To compensate for this vastness in the reads of my kernel, I consider using texture memory, which is "pre-cached" in L1.

Can it be considered "good" practice to do this ?

Side question 1 : If the accesses to that texture are coalesced (supposing row major) does it still has performance gains over non coalesced texture read ?

Side question 2 : As my data is read in a fashion so that each warp reads 1 row, is 2D texture really that useful ? Or is 1D layered texture better for the job ?

Sorry if the side questions are already answered elsewhere , they got through my mind will I was writing and a quick research (probably using erroneous vocabulary) did not yield an answer. Sorry if my question is dumb, my literature about CUDA is limited to the nVidia documentations.

1
So you read your data only 1 time and this read is coalesced and cache line aligned? - dari
Yes. It's exactly that, and that was checked by the profiler - Sachiko.Shinozaki
The primary benefit of caching is for (explicit) data reuse i.e. temporal locality (on reuse). The secondary benefit is for spatial locality, effectively trying to use the cache as a prefetch mechanism. I wouldn't expect a huge benefit from this unless you actually have data reuse. There is probably in some cases some smaller gains to be had from the prefetching, but it will be highly code and access pattern dependent whether this alone will provide any benefit. - Robert Crovella
I am actually developing my texture based implementation right now. On some of the variables, i got from 1.2 % to 50 % cache hit (The implementation is still slightly bug-y , so maybe it's just that i'm not fetching the same data). While I don't have cache reuse, my patterns for each warp are crystal clear, which makes it good with texture memory I believe ..? Anyway, I will probably answer here to document about that when I finish the implementation. - Sachiko.Shinozaki

1 Answers

1
votes

Finally, the texture based implementation did not bring much. From what I understand, while the cache rate when up (~50 %) There definitly is an overhead in the cache hierarchy or the texture units.

To retain (not application specific)

Texture memory comes with a slight overhead, that makes it worth it only in the situations where the filtering given is a benefit AND that the whole texture can fit in the caches, allowing 2D perfectly cached memory that is resistant to non-coalesced accesses.