CUDA Vs OpenCL: Floating point precision change

Question

I'm learning OpenCL by porting few existing CUDA functions. Below are my CUDA and OpenCL kernels. When the same input parameters are passed to both the functions, the outputs are different by an order of ^-3 to ^-4. As I repeatedly invoke these functions, the order of difference increases significantly(which is very bad for my expected output). Is there anything wrong in my OpenCL porting?

Note: I've already tried "-cl-opt-disable" while compiling OpenCL kernel

CUDA Kernel:

__global__ void normalize_kernel(int N, float *x, float *mean, float *variance, int batch, int filters, int spatial)
{
    int index = (blockIdx.x + blockIdx.y*gridDim.x) * blockDim.x + threadIdx.x;
    if (index >= N) return;
    int f = (index/spatial)%filters;

    x[index] = (x[index] - mean[f])/(sqrt(variance[f] + .00001f));
}

OpenCL Kernel:

__kernel void normalize_kernel(int N, __global float *x, __global float *mean, __global float *variance, int filters, int spatial)
{
    int index =  get_group_id(1) * get_global_size(0) + get_global_id(0);
    if (index >= N) return;
    int f = (index/spatial)%filters;

    x[index] = (x[index] - mean[f])/(sqrt(variance[f] + .00001f));
}

Output: CUDA : OpenCL

{'1.293604': '1.293387',
 '0.727771': '0.727677',
 '0.868133': '0.867531',
 '2.195427': '2.195059'...

What makes you think the CUDA output is "more correct" than OpenCL's? Unless both implementations are fully IEEE754 compliant (including the sqrt function - which usually isn't) you can't expect 100% identical output. If this level of precision is not acceptable to you, you might want to switch to using doubles on both platforms. (Don't forget to enable 64-bit double support.) — pmdj
@pmdj I'm not thinking CUDA output is correct. OpenCL output varies from CPU output also. I would like to reproduce either CPU output or CUDA output using OpenCL. — Avis
@pmdj precision difference of the order ^-5 or ^-6 is acceptable. But ^-3 will impact the output very much. — Avis
GPUs typically don't have IEEE754-compliant FPUs, so you won't get consistent behaviour. I suspect the division and sqrt are probably your main culprits, especially if variance values are small. Have you tried replacing the calculation with multiplication by rsqrt()? That might reduce the divergence. (And may increase performance, unless the compiler has already applied that optimisation.) But if you care about precision, you'll need to use true doubles to get a bigger safety cushion. If you want 100% predictability, you'll need to use integer/fixed point maths. — pmdj
This website gives an example implementation for fixed-point maths. I seem to remember an old "Graphics Gems" book having a fixed-point implementation of square root and possibly inverse square root. Only you know your data though, so I can't say wether fixed point maths or double-precision floats would be more appropriate for your use case. — pmdj

Tim Tim · Accepted Answer · 2017-07-25T04:34:15

Check out the -cl-fp32-correctly-rounded-divide-sqrt compile option (OpenCL 1.2 or higher). On some hardware, this will slow your performance little bit, but it may not be that big a deal.

See https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clCompileProgram.html

CUDA Vs OpenCL: Floating point precision change

1 Answers