I have a performance problem when using LDS memory with AMD Radeon HD 6850.
I have two kernels as parts of a N-particle simulation. Each work unit has to calculate force which acts on a corresponding particle based on relative position to other particles. The problematic kernel is:
#define UNROLL_FACTOR 8
//Vernet velocity part kernel
__kernel void kernel_velocity(const float deltaTime,
__global const float4 *pos,
__global float4 *vel,
__global float4 *accel,
__local float4 *pblock,
const float bound)
{
const int gid = get_global_id(0); //global id of work item
const int id = get_local_id(0); //local id of work item within work group
const int s_wg = get_local_size(0); //work group size
const int n_wg = get_num_groups(0); //number of work groups
const float4 myPos = pos[gid];
const float4 myVel = vel[gid];
const float4 dt = (float4)(deltaTime, deltaTime, 0.0f, 0.0f);
float4 acc = (float4)0.0f;
for (int jw = 0; jw < n_wg; ++jw)
{
pblock[id] = pos[jw * s_wg + id]; //cache a particle position; position in array: workgroup no. * size of workgroup + local id
barrier (CLK_LOCAL_MEM_FENCE); //wait for others in the work group
for (int i = 0; i < s_wg; )
{
#pragma unroll UNROLL_FACTOR
for (int j = 0; j < UNROLL_FACTOR; ++j, ++i)
{
float4 r = myPos - pblock[i];
float rSizeSquareInv = native_recip (r.x*r.x + r.y*r.y + 0.0001f);
float rSizeSquareInvDouble = rSizeSquareInv * rSizeSquareInv;
float rSizeSquareInvQuadr = rSizeSquareInvDouble * rSizeSquareInvDouble;
float rSizeSquareInvHept = rSizeSquareInvQuadr * rSizeSquareInvDouble * rSizeSquareInv;
acc += r * (2.0f * rSizeSquareInvHept - rSizeSquareInvQuadr);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
}
acc *= 24.0f / myPos.w;
//update velocity only
float4 newVel = myVel + 0.5f * dt * (accel[gid] + acc);
//write to global memory
vel[gid] = newVel;
accel[gid] = acc;
}
The simulation runs fine in terms of results, but the problem is in the performance when using the local memory for caching the particle positions to relieve the big amount of reading from the global memory. Actually if the line
float4 r = myPos - pblock[i];
is replaced by
float4 r = myPos - pos[jw * s_wg + i];
the kernel runs faster. I don't really get that since reading from global should be much slower than reading from local.
Moreover, when the line
float4 r = myPos - pblock[i];
is removed completely and all following occurences of r are replaced by myPos - pblock[i], the speed is the same as before as if the line was not there at all. This I don't get even more as accessing private memory in r should be the fastest but the compiler somehow "optimizes" this line out.
Global work size is 4608, local worksize is 192. It is run with AMD APP SDK v2.9 and Catalyst drivers 13.12 in Ubuntu 12.04.
Can anyone please help me with this? Is that my fault or is that a problem of the GPU / drivers / ... ? Or is it a feature? :-)