I'm trying to optimize some code originally written in Fortran.
The algorithm involves operating on a large array (~27 million cells) over several iterations. Each cell can be evaluated independently during one iteration. However, the iterations cannot be parallelized because the calculation done at t+1 depends on the results of the calculation done at t.
A rough, simplified nonparallel pseudocode example:
for (t=0; t<tmax; t++)
{
A = A + B;
B = B + A /2;
}
where A and B are big arrays.
Currently, I've implemented this by calling EnqueueNDRangeKernel out of a loop in the host C++ code. Because I need the results of the previous iteration, I am writing into global memory each time.
Doing 27 million global memory writes per iteration kills my performance. I have two kernel versions I'm playing around with; compared to Fortran, version 1 is ~2.5x faster; version 2 is 4x faster.
I've tried fiddling around with the algorithm, as well as doing stuff with pointers (version 2).
My question is as follows: is there any way to avoid this global memory write chokepoint?
Thanks!
Requested code:
Call in C++:
NDRange global(nxp1*ny*nz);
NDRange local(nz);
for (w=0; w<100; w++)
{
queue.enqueueNDRangeKernel(kernA, NullRange, global, local);
queue.enqueueBarrierWithWaitList();
queue.enqueueNDRangeKernel(kernB, NullRange, global, local);
}
queue.finish();
Kernels:
__kernel void kernA(__global double *A, __global double *B)
{
int i = get_global_id(0);
double A_l;
A_l = A[i];
double B_l;
B_l = B[i];
A_l = A_l + B_l;
A[i] = A_l; //if this line is removed, everything goes much faster.
}
__kernel void kernB(__global double *A, __global double *B)
{
int i = get_global_id(0);
double A_l;
A_l = A[i];
double B_l;
B_l = B[i];
B_l = B_l + A_l/2;
B[i] = B_l; //if this line is removed, everything goes much faster.
}
I've simplified the kernel code for the sake of clarifying the algorithm. But the idea is that I update A based on B; then I update B based on A. This is repeated for several iterations.
(i, j)entry of the new A or B, do you need to reference anything other than the(i, j)entries of the old A and B? If not, then you don't need to use global memory for anything other than the input to the first iteration and the output from the last iteration and you can compute many iterations within a single kernel invocation. - user57368