0
votes

I'm trying to optimize some code originally written in Fortran.

The algorithm involves operating on a large array (~27 million cells) over several iterations. Each cell can be evaluated independently during one iteration. However, the iterations cannot be parallelized because the calculation done at t+1 depends on the results of the calculation done at t.

A rough, simplified nonparallel pseudocode example:

for (t=0; t<tmax; t++)
{
A = A + B;

B  = B + A /2;
}

where A and B are big arrays.

Currently, I've implemented this by calling EnqueueNDRangeKernel out of a loop in the host C++ code. Because I need the results of the previous iteration, I am writing into global memory each time.

Doing 27 million global memory writes per iteration kills my performance. I have two kernel versions I'm playing around with; compared to Fortran, version 1 is ~2.5x faster; version 2 is 4x faster.

I've tried fiddling around with the algorithm, as well as doing stuff with pointers (version 2).

My question is as follows: is there any way to avoid this global memory write chokepoint?

Thanks!


Requested code:

Call in C++:

NDRange global(nxp1*ny*nz);
NDRange local(nz);

    for (w=0; w<100; w++)
    {
       queue.enqueueNDRangeKernel(kernA, NullRange, global, local);
        queue.enqueueBarrierWithWaitList();
       queue.enqueueNDRangeKernel(kernB, NullRange, global, local);
    }

queue.finish();

Kernels:

__kernel void kernA(__global double *A, __global double *B)
    {
    int i = get_global_id(0);

    double A_l;
    A_l = A[i];
    double B_l;
    B_l = B[i];

    A_l = A_l + B_l;

    A[i] = A_l; //if this line is removed, everything goes much faster.

    }

   __kernel void kernB(__global double *A, __global double *B)
    {
    int i = get_global_id(0);

    double A_l;
    A_l = A[i];
    double B_l;
    B_l = B[i];

    B_l = B_l + A_l/2;

    B[i] = B_l; //if this line is removed, everything goes much faster.
    }

I've simplified the kernel code for the sake of clarifying the algorithm. But the idea is that I update A based on B; then I update B based on A. This is repeated for several iterations.

2
In order to compute the (i, j) entry of the new A or B, do you need to reference anything other than the (i, j) entries of the old A and B? If not, then you don't need to use global memory for anything other than the input to the first iteration and the output from the last iteration and you can compute many iterations within a single kernel invocation. - user57368
I just need to know the old A(i) and B(i) from the previous iteration. Are you suggesting to run the iteration loop within the kernel? Would that preserve order of the iterations? - Vlad
what does a cell consist of? a single value? int,float, double, or something else? you can probably compute many more than 1 iteration in the same kernel call. - mfa
Can you post your kernel and your C/C++ code? Are you using local memory in any way? Vectors? - Austin
A, B are doubles, containing ~27 M cells. Will post relevant parts of code in a moment. - Vlad

2 Answers

0
votes

There is no way to entirely avoid the global write problem. You are writing the values once, and your speed is hardware-bound. You can cut down the number of global reads though, as long as you don't mind computing multiple steps at once. This still saves each step along the way.

__kernel void myKernel(__global double *A, __global double *B, __global uint outDataMultiple)              
{                                                                                      
    const uint gid = get_global_id(0);
    const uint inDataSize = get_global_size(0);

    double2 nextValue;
    nextValue.x = A[gid];
    nextValue.y = B[gid];
    for(uint i=0; i<outDataMultiple; i++){
        nextValue.x = nextValue.x + nextValue.y;
        nextValue.y = nextValue.y + nextValue.x /2;
        A[gid+i+1] = nextValue.x;
        B[gid+i+1] = nextValue.y;
    }
}

With the kernel above, a work item will take care of multiple iterations for a single cell. You need to allocate outDataMultiple times more memory, and the kernel will fill out the rest. The global work item count determines the size of the initial input. outDataMultiple is limited only by global memory allocation and possibly the complexity of the math you are doing with each iteration.

Total global memory required: 27M * sizeof(double2) * (1+outDataMultiple)

__kernel void myKernel(__global double2 *data, __global uint outDataMultiple)              
{                                                                                      
    const uint gid = get_global_id(0);
    const uint inDataSize = get_global_size(0);

    double2 nextValue = data[gid];
    for(uint i=0; i<outDataMultiple; i++){
        nextValue.x = nextValue.x + nextValue.y;
        nextValue.y = nextValue.y + nextValue.x /2;
        data[gid+i+1] = nextValue;
    }
}

The double2 version of the same kernel may be possible as long as you can interlace the A and B vectors. This will combine reads and writes to guarantee 8-byte blocks, and probably improve performance a bit more.

0
votes

An easy way to reduce the time the OpenCL device spends fetching from global memory is to buffer the global memory to local memory in a batch, operate on the local memory, and then write the local memory to global memory in a batch.

Local memory has essentially the same latency as thread memory and can be read from global memory in chunks. Local memory can be declared on the host and passed to the kernel (see example below) or allocated in the kernel and used (see example in the AMD optimization guide listed below). For example:

__kernel void kernA(__global double *A, 
                    __global double *B, 
                    __local double *BufferA,
                    __local double *BufferB)
    {
    BufferA[get_local_id(0)] = A[get_global_id(0)];
    BufferB[get_local_id(0)] = B[get_global_id(0)];
    mem_fence(CLK_LOCAL_MEM_FENCE);

    double tmp = BufferA[get_local_id(0)] + BufferB[get_local_id(0)];

    A[get_global_id(0)] = BufferA[get_local_id(0)];
    mem_fence(CLK_GLOBAL_MEM_FENCE);
    }

There are further things that can be done, including: