OpenCL data Synchronization

Question

I am multiplying a row of a matrix with the inverse of the principle diagonal element of that row. I have implemented it with 1-D parallel code. All the thread runs this code

1.read the principle diagonal element
2.calculate the inverse of that element
3.multiply inverse with the element indexed at the thread id

The problem arises when ith thread in ith row executes step 3 before other thread executes step 1. It changes the value of the principle diagonal element before others can read it.

Does OpenCL have any barrier like thing which only allows a thread to execute step 3 after all threads executes step 1?

I don't want to use empty loops because there can be worst cases when it can get failed.

There are only barriers for workgoup wide. If your code has only one workgroup then it is possible but it is highly not adiveced if you want to run your code on GPU. — Jovasa
I am using a GPU and All these threads are spawned by a single call to enqueueNDRangeKernel so I guess they are in the same global work group. — Deep Joshi
No, the call spawns K Workitems (Global size) that form workgoups that each have N workitems (local size) — Jovasa
How large is your matrix? Why don't you output to another matrix? — BlueWanderer
I am performing reduce-raw operation on a really large matrix. Currently for testing it can be less than 10x10. But it is supposed to work with larger size than 1000x1000. — Deep Joshi

EdwinDebuger EdwinDebuger · Accepted Answer · 2017-04-14T14:49:46

One way is to add barrier(CLK_CL_LOCAL_MEM_FENCE) .

The other way is sperating the work in two kernels, but you can pass the cl_mem computed from step1's kernel directly to step3's kernel.This won't cause CPU/GPU IO.

The diagonal matrix multiplies a dense matrix is a set of dot product which can be done by using reduce. That will make your function faster.

OpenCL data Synchronization

1 Answers