In the processus of optimizing and profiling a kernel, I noticed that it's L2 and global cache hit frequency was very low (~1.2 % avg.). My kernel typically reads 4 full cache lines per pass per warp, with 3 blocks per SM (So 4 * 32 * 2 = 256 lines of cache per SM per pass of my kernel, that has a variable pass number). The reads are from different regions of global memory, which is obviously then hard to cache. (The pattern of the regions is A, 32 * B, A .....)
It is then made clear that for data that is so "dispersed" and read only 1 time before moving on, L1/L2 cache is almost useless. To compensate for this vastness in the reads of my kernel, I consider using texture memory, which is "pre-cached" in L1.
Can it be considered "good" practice to do this ?
Side question 1 : If the accesses to that texture are coalesced (supposing row major) does it still has performance gains over non coalesced texture read ?
Side question 2 : As my data is read in a fashion so that each warp reads 1 row, is 2D texture really that useful ? Or is 1D layered texture better for the job ?
Sorry if the side questions are already answered elsewhere , they got through my mind will I was writing and a quick research (probably using erroneous vocabulary) did not yield an answer. Sorry if my question is dumb, my literature about CUDA is limited to the nVidia documentations.