I am multiplying a row of a matrix with the inverse of the principle diagonal element of that row. I have implemented it with 1-D parallel code. All the thread runs this code
1.read the principle diagonal element
2.calculate the inverse of that element
3.multiply inverse with the element indexed at the thread id
The problem arises when ith thread in ith row executes step 3 before other thread executes step 1. It changes the value of the principle diagonal element before others can read it.
Does OpenCL have any barrier like thing which only allows a thread to execute step 3 after all threads executes step 1?
I don't want to use empty loops because there can be worst cases when it can get failed.