1
votes

I have locally sorted queues in different blocks of cuda. Let's say that there are m blocks. Now I have some problems.

1) I need to select only k blocks of out m blocks whose heads of queue is minimum k elements out of m elements.

2) In one block I need to load into shared memory the queues of other blocks. Can this be done?

Can anyone please tell me how to do these two operations?

1

1 Answers

2
votes
  1. If you want to communicate (i.e. exchange data) between threadblocks, the only method is to use global memory.

    At a minimum, you would need some sort of selection process that can access the heads of each queue. I think this pretty much implies that you are going to have the place the heads of each queue in global memory. Since you don't indicate where your "locally sorted" data resides, this could indicate a copying of that much data at least (for example, if they are locally sorted and reside in shared memory).

  2. If a single block needs to load all queues, then all queues will need to be placed in global memory by their respective blocks.

Both of your questions imply some sort of global synchronization. You want to sort all the queues before you collect them. In CUDA there is no defined global synchronization mechanism except the kernel launch. However, based on what you've described here, your algorithm might be amenable to an approach similar to what is outlined in the threadfence reduction sample. Each threadblock would do the work it needs to (e.g. sorting the queues) and then a single threadblock would perform the clean-up tasks such as collecting the queues and processing in a single threadblock. I'm not sure if this will fit your overall processing. If not, my suggestion would be to start by breaking your work into separate kernels, and using the kernel launch(es) as sync points.