Hi i have kernel function, where i need to compare bytes. Area where i want to search is divided into blocks, so array of 4k bytes is divided to 4k/256 = 16 blocks. Each thread in block reads array on idx and compare it with another array, where is what i want to search. I've done this by two ways:
1.Compare data in global memory, but often threads in block need to read the same address.
2.Copy data from global memory to shared memory, and compare bytes in shared memory in the same way as mentioned above. Still problem with same address read. Copy to shared memory looks like this:
myArray[idx] = global[someIndex-idx];
whatToSearch[idx] = global[someIndex+idx];
Rest of the code is the same. Only operations on data in example 2 are performed in shared arrays.
But first option is about 10% faster, than that with the shared memory, why?? Thank you for explanations.