3
votes

Hi i have kernel function, where i need to compare bytes. Area where i want to search is divided into blocks, so array of 4k bytes is divided to 4k/256 = 16 blocks. Each thread in block reads array on idx and compare it with another array, where is what i want to search. I've done this by two ways:

1.Compare data in global memory, but often threads in block need to read the same address.

2.Copy data from global memory to shared memory, and compare bytes in shared memory in the same way as mentioned above. Still problem with same address read. Copy to shared memory looks like this:

myArray[idx] = global[someIndex-idx];
whatToSearch[idx] = global[someIndex+idx];

Rest of the code is the same. Only operations on data in example 2 are performed in shared arrays.

But first option is about 10% faster, than that with the shared memory, why?? Thank you for explanations.

2
Please post a complete example. Without it all of the current answers are pure speculation. Your comments on the answers below are not enough to make it clear what you are doing.harrism

2 Answers

12
votes

If you are only using the data once and there is no data reuse between different threads in a block, then using shared memory will actually be slower. The reason is that when you copy data from global memory to shared, it still counts as a global transaction. Reads are faster when you read from shared memory, but it doesn't matter because you already had to read the memory once from global, and the second step of reading from shared memory is just an extra step that doesn't provide anything of value.

So, the key point is that using shared memory is only useful when you need to access the same data more than once (whether from the same thread, or from different threads in the same block).

1
votes

You are using shared memory to save on accesses to global memory, but each thread is still making two accesses to global memory, so it won't be faster. The speed drop is probably because the threads that access the same location in global memory within a block try to read it into the same location in shared memory, and this needs to be serialized.

I'm not sure of exactly what you are doing from the code you posted, but you should ensure that the number of times global is read from and written to, aggregated across all the threads in a block, is significantly lower when you use shared memory. Otherwise you won't see a performance improvement.