Question
When moving many random (non-coalesced) values within a device global memory, which is the most efficient way to do it?
Note: Many values as in > 500.
Context
I've been working in a Genetic Algorithm implementation for GPU's for some time now, and I've been trying struggling between the flexibility of my framework, and micro optimizing for the GPU architecture. The GA data resides in the GPU at all times. Only best generational solutions are copied to the host memory.
Detailed scenario
I'm optimizing the migration function. Here basically little data is shuffled within the device Global Memory. But I've got my data order in such way it's coalesced for the GA operators kernel threads' memory access scheme, which makes shuffling a pair of "genomes", a matter of striding for single FLOAT values, and swapping them with another genome in the same striding fashion.
Known solutions
The problem isn't memory bandwidth, but a matter of call latency and thread blocks stalling the process.
I wrote a couple of device kernels, which function is merely to move values among the addresses. This would launch a kernel (with VERY low occupancy, divergent code and random memory access... therefor the little code it runs, would be serialized), but would do the work with only two kernel calls to the device.
- 1st Kernel Copies values to buffer array.
- 2nd Kernel Swaps values.
I'm aware I could use cudaMemcpy for every value, but that would require many calls to cudaMemCpy, which I think to be synchronous calls.
Simplified code example:
int needed_genome_idx = 0; // Some random index.
for(int nth_gene = 0; nth_gene < num_genes; ++nthgene)
{
cudaMemcpy(genomes_buffer + nth_gene,
src + needed_genome_idx + nth_gene * stride_size, // stride_size being a big number, usually equal to the size of the GA population.
sizeof(float),
cudaMemCpyDeviceToDevice);
}
Is this a viable solution? Would using cudaMemCpyAsync help performance?
Is there a better way, or at least more elegant way, to do such memory operations?
cudaMemcpy()call is obviously a VERY inefficient way - kangshiyincudaMemcpy()/cmdaMemcpyAsync()will reach its max speed when the size of the data > 100 KBbytes as shown in the D2D copy speed here stackoverflow.com/questions/17729351/… - kangshiyincudaMemcpyDeviceToDevicecan be slower than writing a kernel to do the same operation, see CUDA Device To Device transfer expensive. Your code isn't very clear to me:genomes_bufferandsrcare undefined andneeded_genome_idxis claimed to be random, but set to0. Could you be a bit more clear on what actually "migration" means for you? - Vitality