0
votes

I recently started wandering into GPGPU, and wrote an MD simulation program as a class assignment. The part to calculate Force from other particles is as below.

vec3 pn = poss[gid].xyz;
vec3 f = vec3(0, 0, 0);
for (uint i = 0; i < gid; i++) {
    f += df(pn, poss[i].xyz);
}
for (uint i = gid+1; i < num; i++) {
    f += df(pn, poss[i].xyz);
}
fors[gid].xyz = f;

Running this code with 32000 instances (500 threads) took 50ms on my GTX960.

My instructor suggested merging the 2 loops, as thread synchronisation(?) causes the long execution time. So I changed it to as below.

for (uint i = 0; i < num; i++) {
    if (i != gid) f += df(pn, poss[i].xyz);
}

However, this took 65ms (15ms longer) to run. So,

  1. Is it true that for modern hardware (GL4.3+), for loops of variable length in local threads still need to all finish before continuing, and
  2. If so, why is the 2nd code slower?

Thank you very much.

Edit: df will return infinity for the same particle, so removing the conditional expression is not an option.

1

1 Answers

0
votes

1) Well I am still new to compute shaders but what I think is Yes. If different invocations within a single workgroup execute variable lenth for loops then some threads may have finished earlier and will have to wait until all the threads finish executing within that particular workgroup. After that the workgroup maybe swapped for another unprocessed workgroup.

2) Because both codes aren't equivalent. By equivalent I mean same statements. The above one has 2 loops. While the below one has 1 big loop with an "if" statement inside. So my guess is that If statement is causing the higher running time.

If you can remove the "If" statement by adding another vector in your array poss at the index gid so that at gid the vector that gets added to f after processing is vec3(0,0,0), I think it's gonna improve your execution time.