2
votes

I have a 2D host array with 10 rows and 96 columns. I load this array to my cuda device global memory linearly i.e. row1, row2, row3 ... row10.

The array is of type float. In my kernel each thread accesses one float value from the device global memory.

 The BLOCK_SIZE I use is = 96
 The GRID_DIM I use is = 10

Now what I understood from the "Cuda C programming guide" for coalesced accesses, the pattern I am using is correct, access consecutively memory location by warp. But there is a clause about memory 128 byte memory alignment. Which I fail to understand.

Q1) 128 bytes memory alignment; Does it mean that each thread in a warp should access 4 bytes starting from an address 0x00 (for example) till 0x80?

Q2) So in the scenario, will I be making uncoalesced accesses or not?

My understanding is: one thread should make one memory access with should be 4 bytes, from range of address such as from 0x00 to 0x80. If a thread from a warp accesses a location outside it, its an uncoalesced access.

1

1 Answers

9
votes

Loads from global memory are usually done in chunks of 128 bytes, aligned on 128 byte boundaries. Coalesced memory access means that you keep all accesses from your warp to one chunk of 128 bytes. (In older cards, the memory had to be accessed in order of thread id, but newer cards no longer have this requirement.)

If the 32 threads in your warp each read a float, you will read a total of 128 bytes from global memory. If the memory is aligned correctly, all reads will be from the same block. If alignment is off, you'll need two reads. If you do something like a[32*i], then each access will come from a different 128 byte block in global memory, which will be very slow.

It doesn't matter which block you access, as long as all threads in a warp access the same block.

If you have an array of 96 floats, then if each thread with index i in your warp accesses a[i], it will be a coalesced read. Same with either a[i+32] or a[i+64].

So, the answer to Q1 is that all threads need to stay within the same block of length 128 bytes aligned on 128 byte boundaries.

The answer to your Q2 is that if your arrays are aligned correctly, and your accesses are of the form a[32*x+i] with i the thread id and x any integer that is the same for all threads, your accesses will be coalesced.

According to Section 5.3.2.1.1 of the programming guide, memory is always aligned on at least 256 byte boundaries, so arrays created with cudaMalloc are always aligned correctly.