Why does the same OpenCL code have different outputs from Intel Xeon CPU and NVIDIA GTX 1080 Ti GPU?

Question

I am trying to parallelize Monte Carlo simulation by using OpenCL. I use the MWC64X as a uniform random number generator. The code runs well on different Intel CPUs, since the output of parallel computation is very close to the sequential one.

Using OpenCL device: Intel(R) Xeon(R) CPU E5-2630L v3 @ 1.80GHz
Literal influence running time: 0.029048 seconds        r1 seqInfl= 0.4771
Literal influence running time: 0.029762 seconds        r2 seqInfl= 0.4771
Literal influence running time: 0.029742 seconds        r3 seqInfl= 0.4771
Literal influence running time: 0.02971 seconds         ra seqInfl= 0.4771
Literal influence running time: 0.029225 seconds        trust1-57 seqInfl= 0.6001
Literal influence running time: 0.04992 seconds         trust110-1 seqInfl= 0
Literal influence running time: 0.034636 seconds        trust4-57 seqInfl= 0
Literal influence running time: 0.049079 seconds        trust57-110 seqInfl= 0
Literal influence running time: 0.024442 seconds        trust57-4 seqInfl= 0.8026
Literal influence running time: 0.04946 seconds         trust33-1 seqInfl= 0
Literal influence running time: 0.049071 seconds        trust57-33 seqInfl= 0
Literal influence running time: 0.053117 seconds        trust4-1 seqInfl= 0.1208
Literal influence running time: 0.051642 seconds        trust57-1 seqInfl= 0
Literal influence running time: 0.052052 seconds        trust57-64 seqInfl= 0
Literal influence running time: 0.052118 seconds        trust64-1 seqInfl= 0
Literal influence running time: 0.051998 seconds        trust57-7 seqInfl= 0
Literal influence running time: 0.052069 seconds        trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.71728 seconds
Sequential maxInfluence Literal: trust57-4 0.8026

index1= 17 size= 51 dim1_size= 6
sum0:4781   influence0:0.478100 sum2:4781   influence2:0.478100 sum6:0  influence6:0.000000 sum10:0 sum12:0 influence12:0.000000    sum7:0  influence7:0.000000 influence10:0.000000    sum4:5962   influence4:0.596200 sum8:7971   influence8:0.797100 sum1:4781   influence1:0.478100 sum3:4781   influence3:0.478100 sum13:0 influence13:0.000000    sum11:1261  influence11:0.126100    sum9:0  influence9:0.000000 sum14:0 influence14:0.000000    sum5:0  influence5:0.000000 sum15:0 influence15:0.000000    sum16:0 influence16:0.000000    
Parallel influence running time: 0.054391 seconds
Parallel maxInfluence Literal: trust57-4 Infl=0.7971

However, when I run the code on GeForce GTX 1080 Ti, with NVIDIA-SMI 430.40 and CUDA 10.1 and OpenCL 1.2 CUDA installed, the output is as below:

Using OpenCL device: GeForce GTX 1080 Ti
Influence:
Literal influence running time: 0.011119 seconds        r1 seqInfl= 0.4771
Literal influence running time: 0.011238 seconds        r2 seqInfl= 0.4771
Literal influence running time: 0.011408 seconds        r3 seqInfl= 0.4771
Literal influence running time: 0.01109 seconds         ra seqInfl= 0.4771
Literal influence running time: 0.011132 seconds        trust1-57 seqInfl= 0.6001
Literal influence running time: 0.018978 seconds        trust110-1 seqInfl= 0
Literal influence running time: 0.013093 seconds        trust4-57 seqInfl= 0
Literal influence running time: 0.018968 seconds        trust57-110 seqInfl= 0
Literal influence running time: 0.009105 seconds        trust57-4 seqInfl= 0.8026
Literal influence running time: 0.018753 seconds        trust33-1 seqInfl= 0
Literal influence running time: 0.018583 seconds        trust57-33 seqInfl= 0
Literal influence running time: 0.02005 seconds         trust4-1 seqInfl= 0.1208
Literal influence running time: 0.01957 seconds         trust57-1 seqInfl= 0
Literal influence running time: 0.019686 seconds        trust57-64 seqInfl= 0
Literal influence running time: 0.019632 seconds        trust64-1 seqInfl= 0
Literal influence running time: 0.019687 seconds        trust57-7 seqInfl= 0
Literal influence running time: 0.019859 seconds        trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.272032 seconds
Sequential maxInfluence Literal: trust57-4 0.8026

index1= 17 size= 51 dim1_size= 6
sum0:10000  sum1:10000  sum2:10000  sum3:10000  sum4:10000  sum5:0  sum6:0  sum7:0  sum8:10000  sum9:0  sum10:0 sum11:0 sum12:0 sum13:0 sum14:0 sum15:0 sum16:0 
Parallel influence running time: 0.193581 seconds

The "Influence" value equals sum*1.0/10000, thus the parallel influence only composes of 1 and 0, which is incorrect (in GPU runs) and doesn't happen when parallelizing on a Intel CPU.

When I check the output of the random number generator if(flag==0) printf("randint=%u",randint);, it seems the outputs are all zero on GPU. Below is the clinfo and the .cl code:

 Device Name                                     GeForce GTX 1080 Ti
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 1.2 CUDA
  Driver Version                                  430.40
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Topology (NV)                            PCI-E, 68:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               28
  Max clock frequency                             1721MHz
  Compute Capability (NV)                         6.1
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x64
  Max work group size                             1024
  Preferred work group size multiple              32
  Warp size (NV)                                  32
  Preferred / native vector sizes                 
    char                                                 1 / 1       
    short                                                1 / 1       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 0 / 0        (n/a)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              11720130560 (10.92GiB)
  Error Correction support                        No
  Max memory allocation                           2930032640 (2.729GiB)
  Unified memory for Host and Device              No
  Integrated memory (NV)                          No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       4096 bits (512 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        458752 (448KiB)
  Global Memory cache line size                   128 bytes
  Image support                                   Yes
    Max number of samplers per kernel             32
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             16384x32768 pixels
    Max 3D image size                             16384x16384x16384 pixels
    Max number of read image args                 256
    Max number of write image args                16
  Local memory type                               Local
  Local memory size                               49152 (48KiB)
  Registers per block (NV)                        65536
  Max number of constant args                     9
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     4352 (4.25KiB)
  Queue properties                                
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Kernel execution timeout (NV)                 Yes
  Concurrent copy and kernel execution (NV)       Yes
    Number of async copy engines                  2
  printf() buffer size                            1048576 (1024KiB)

#define N 70 // N > index, which is the total number of literals
#define BASE 4294967296UL

//! Represents the state of a particular generator
typedef struct{ uint x; uint c; } mwc64x_state_t;
enum{ MWC64X_A = 4294883355U };
enum{ MWC64X_M = 18446383549859758079UL };

void MWC64X_Step(mwc64x_state_t *s)
{
    uint X=s->x, C=s->c;

    uint Xn=MWC64X_A*X+C;
    uint carry=(uint)(Xn<C);                // The (Xn<C) will be zero or one for scalar
    uint Cn=mad_hi(MWC64X_A,X,carry);  

    s->x=Xn;
    s->c=Cn;
}

//! Return a 32-bit integer in the range [0..2^32)
uint MWC64X_NextUint(mwc64x_state_t *s)
{
    uint res=s->x ^ s->c;
    MWC64X_Step(s);
    return res;
}


__kernel void setInfluence(const int literals, const int size, const int dim1_size, __global int* lambdas, __global float* lambdap, __global int* dim2_size, __global float* influence){   
    int flag=get_global_id(0);
    int sum=0;
    int count=10000;
    int assignment[N];
    //or try to get newlambda like original version does
    if(flag < literals){
        mwc64x_state_t rng;
        for(int i=0; i<count; i++){
            for(int j=0; j<size; j++){
                uint randint=MWC64X_NextUint(&rng);
                float rand=randint*1.0/BASE;
                //if(flag==0)
                //  printf("randint=%u",randint);
                if(lambdap[j]<rand)
                    assignment[lambdas[j]]=0;
                else
                    assignment[lambdas[j]]=1;               
            }
            //the true case
            assignment[flag]=1;
            int valuet=0;
            int index=0;
            for(int m=0; m<dim1_size; m++){
                int valueMono=1;
                for(int n=0; n<dim2_size[m]; n++){
                    if(assignment[lambdas[index+n]]==0){
                        valueMono=0;
                        index+=dim2_size[m];
                        break;
                    }
                }
                if(valueMono==1){
                    valuet=1;
                    break;
                }
            }        
            //the false case
            assignment[flag]=0;
            int valuef=0;
            index=0;
            for(int m=0; m<dim1_size; m++){
                int valueMono=1;
                for(int n=0; n<dim2_size[m]; n++){
                    if(assignment[lambdas[index+n]]==0){
                        valueMono=0;
                        index+=dim2_size[m];
                        break;
                    }
                }
                if(valueMono==1){
                    valuef=1;
                    break;
                }
            }
            sum += valuet-valuef;            
        }
        influence[flag] = 1.0*sum/count;
        printf("sum%d:%d\t", flag, sum);
    }
}

What might be the problem when running the code on GPU? Is it MWC64X? According to its author, it can perform well on NVIDIA GPUs. If so, how can I fix it; if not, what might be the problem?

You're not initialising your mwc64x_state_t rng variable, so any results will be undefined. Note that you will probably want to seed your RNG differently for each work-item, otherwise you will get nasty correlation artifacts in your results. — pmdj
@pmdj Thank you. After initialization, the results are correct on both CPU and GPU. — Chenyuan Wu

pmdj pmdj · Accepted Answer · 2019-08-27T15:46:08

(This started out as a comment, it turns out this was the source of the problem so I'm turning it into an answer.)

You're not initialising your mwc64x_state_t rng; variable before reading from it, so any results will be undefined:

    mwc64x_state_t rng;
    for(int i=0; i<count; i++){
        for(int j=0; j<size; j++){
            uint randint=MWC64X_NextUint(&rng);

Where MWC64X_NextUint() immediately reads from the rng state before updating it:

uint MWC64X_NextUint(mwc64x_state_t *s)
{
    uint res=s->x ^ s->c;

Note that you will probably want to seed your RNG differently for each work-item, otherwise you will get nasty correlation artifacts in your results.

Why does the same OpenCL code have different outputs from Intel Xeon CPU and NVIDIA GTX 1080 Ti GPU?

2 Answers

RESUME :