I have a 2D host array with 10 rows and 96 columns. I load this array to my cuda device global memory linearly i.e. row1, row2, row3 ... row10.
The array is of type float. In my kernel each thread accesses one float value from the device global memory.
The BLOCK_SIZE I use is = 96
The GRID_DIM I use is = 10
Now what I understood from the "Cuda C programming guide" for coalesced accesses, the pattern I am using is correct, access consecutively memory location by warp. But there is a clause about memory 128 byte memory alignment. Which I fail to understand.
Q1) 128 bytes memory alignment; Does it mean that each thread in a warp should access 4 bytes starting from an address 0x00 (for example) till 0x80?
Q2) So in the scenario, will I be making uncoalesced accesses or not?
My understanding is: one thread should make one memory access with should be 4 bytes, from range of address such as from 0x00 to 0x80. If a thread from a warp accesses a location outside it, its an uncoalesced access.