I recently started wandering into GPGPU, and wrote an MD simulation program as a class assignment. The part to calculate Force from other particles is as below.
vec3 pn = poss[gid].xyz;
vec3 f = vec3(0, 0, 0);
for (uint i = 0; i < gid; i++) {
f += df(pn, poss[i].xyz);
}
for (uint i = gid+1; i < num; i++) {
f += df(pn, poss[i].xyz);
}
fors[gid].xyz = f;
Running this code with 32000 instances (500 threads) took 50ms on my GTX960.
My instructor suggested merging the 2 loops, as thread synchronisation(?) causes the long execution time. So I changed it to as below.
for (uint i = 0; i < num; i++) {
if (i != gid) f += df(pn, poss[i].xyz);
}
However, this took 65ms (15ms longer) to run. So,
- Is it true that for modern hardware (GL4.3+), for loops of variable length in local threads still need to all finish before continuing, and
- If so, why is the 2nd code slower?
Thank you very much.
Edit: df will return infinity for the same particle, so removing the conditional expression is not an option.