How to shuffle data efficiently within device memory?

Question

Question

When moving many random (non-coalesced) values within a device global memory, which is the most efficient way to do it?

Note: Many values as in > 500.

Context

I've been working in a Genetic Algorithm implementation for GPU's for some time now, and I've been trying struggling between the flexibility of my framework, and micro optimizing for the GPU architecture. The GA data resides in the GPU at all times. Only best generational solutions are copied to the host memory.

Detailed scenario

I'm optimizing the migration function. Here basically little data is shuffled within the device Global Memory. But I've got my data order in such way it's coalesced for the GA operators kernel threads' memory access scheme, which makes shuffling a pair of "genomes", a matter of striding for single FLOAT values, and swapping them with another genome in the same striding fashion.

Known solutions

The problem isn't memory bandwidth, but a matter of call latency and thread blocks stalling the process.

I wrote a couple of device kernels, which function is merely to move values among the addresses. This would launch a kernel (with VERY low occupancy, divergent code and random memory access... therefor the little code it runs, would be serialized), but would do the work with only two kernel calls to the device.
- 1st Kernel Copies values to buffer array.
- 2nd Kernel Swaps values.
I'm aware I could use cudaMemcpy for every value, but that would require many calls to cudaMemCpy, which I think to be synchronous calls.

Simplified code example:

int needed_genome_idx = 0; // Some random index.
for(int nth_gene = 0; nth_gene < num_genes; ++nthgene)
{
  cudaMemcpy(genomes_buffer + nth_gene,
             src + needed_genome_idx + nth_gene * stride_size, // stride_size being a big number, usually equal to the size of the GA population.
             sizeof(float),
             cudaMemCpyDeviceToDevice);
}

Is this a viable solution? Would using cudaMemCpyAsync help performance?

Is there a better way, or at least more elegant way, to do such memory operations?

What is the length of your genome vector? How many pairs of genomes do you need to shuffle? When you are shuffling one pair of the genome vectors, how many elements in the vector should be swapped? How these elements are chosen? (randomly or a range?) How is your genomes stored in the global mem? Could you give a more specific code sample? — kangshiyin
copying only 1 float number with 1 cudaMemcpy() call is obviously a VERY inefficient way — kangshiyin
I'm implementing the framework, and I'm aiming for flexibility. The nature of the GPU implementation calls for a GA setup using plenty of isles, with large populations and high dimensionality genomes, so it makes sense using the GPU instead of the CPU. So these details you are asking for, would vary. But typically I'm using the following: The genome vector would be in the 10-50 range, the genomes to shuffle would be most likely one per isle, ~16 (but could be a larger number), and the whole vector would be swapped in each migration operation (usually after a certain number of generations). — RdelCueto
it will be no big differences if you still copy one float number at a time. generally cudaMemcpy()/cmdaMemcpyAsync() will reach its max speed when the size of the data > 100 KBbytes as shown in the D2D copy speed here stackoverflow.com/questions/17729351/… — kangshiyin
cudaMemcpyDeviceToDevice can be slower than writing a kernel to do the same operation, see CUDA Device To Device transfer expensive. Your code isn't very clear to me: genomes_buffer and src are undefined and needed_genome_idx is claimed to be random, but set to 0. Could you be a bit more clear on what actually "migration" means for you? — Vitality

Kill Console Kill Console · Accepted Answer · 2013-09-02T07:41:10

You can try to write a kernel to complete the shuffle, maybe more efficient than call cudaMemcpy so many times.