I'm learning OpenCL by porting few existing CUDA functions. Below are my CUDA and OpenCL kernels. When the same input parameters are passed to both the functions, the outputs are different by an order of ^-3 to ^-4. As I repeatedly invoke these functions, the order of difference increases significantly(which is very bad for my expected output). Is there anything wrong in my OpenCL porting?
Note: I've already tried "-cl-opt-disable" while compiling OpenCL kernel
CUDA Kernel:
__global__ void normalize_kernel(int N, float *x, float *mean, float *variance, int batch, int filters, int spatial)
{
int index = (blockIdx.x + blockIdx.y*gridDim.x) * blockDim.x + threadIdx.x;
if (index >= N) return;
int f = (index/spatial)%filters;
x[index] = (x[index] - mean[f])/(sqrt(variance[f] + .00001f));
}
OpenCL Kernel:
__kernel void normalize_kernel(int N, __global float *x, __global float *mean, __global float *variance, int filters, int spatial)
{
int index = get_group_id(1) * get_global_size(0) + get_global_id(0);
if (index >= N) return;
int f = (index/spatial)%filters;
x[index] = (x[index] - mean[f])/(sqrt(variance[f] + .00001f));
}
Output: CUDA : OpenCL
{'1.293604': '1.293387',
'0.727771': '0.727677',
'0.868133': '0.867531',
'2.195427': '2.195059'...
sqrtfunction - which usually isn't) you can't expect 100% identical output. If this level of precision is not acceptable to you, you might want to switch to usingdoubles on both platforms. (Don't forget to enable 64-bit double support.) - pmdjsqrtare probably your main culprits, especially if variance values are small. Have you tried replacing the calculation with multiplication byrsqrt()? That might reduce the divergence. (And may increase performance, unless the compiler has already applied that optimisation.) But if you care about precision, you'll need to use true doubles to get a bigger safety cushion. If you want 100% predictability, you'll need to use integer/fixed point maths. - pmdj