At the high level, GPUs are data parallel computing devices. They like to run the same task on different data. They don't do well when their tasks do different things.
Your code is illustrative of a task parallel problem. So my high level question is what type of problem are you solving.? If it's a a task parallel problem then perhaps a GPU isn't the best solution. Would a multi-core CPU be an alternative?
You code is a typical of a 'spinlock'. Where the code loops until a value changes. Its often used for short term light weight locking in databases. This is dangerous code even on a CPU, as a mistake or error can lockup the CPU or GPU. For CPU code, a spinlock is usually protected with a interrupt timer.
The usage is
1) set a timer
2) spin until a value changes
3) continue or time-out
So after the requisite number of ms the code is interrupted and an error is thrown. So if you use the spinlock pattern, for safety, add a loop exit in the while statement after a suitable number of loops have been completed.
In OpenCL reduction algorithms, its typical for the zero thread (get_global_id(0) == 0)
to return the final singleton result. Prior to this all threads would been synchronized using a barrier call
__kernel
void mytask( ... , global float * result )
{
int thread = get_global_id(0);
... your code
barrier( CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE ) // flush global and local variables or enqueue a memory fence see OpenCL spec for details
if ( thread == 0) // Zero thread
result[0] = value; // set the singleton result as the zeroth array element
}