Endianness Of OpenCL devices

Question

Apart from precision / accuracy / alignment inconsistencies between a gpu, a cpu and a customized opencl chip, why all opencl examples over internet directly accept host-side buffers?

clEnqueueWriteBuffer (  cl_command_queue command, ...)

__kernel void vecAdd(__global float * c, ..)
{
   int id=get_global_id(0);
   c[id]=b[id]+a[id]
}

lets assume host is little endian but device is big endian and a,b,c, are floats:

Would b and a be loaded as non-normalized garbage?

if yes, do I have to send a single byte telling kernel about buffer endianness?
if no, are they converted(using 4 byte SIMD shuffling of CPU? Or at GPU?) during buffer read-write operations automatically? Even for USE_HOST_PTR(RAM) type buffers and CL_MEM_READ_WRITE(GDDR)?

If unknown, would below example functions work always?

    int endianness_of_device()
    {
            unsigned int x = 1;
            return ((  (  (char *)&x) [0])==0?0:1) ;
    }

    int reverseBytesInt(int bytes)
    {
        void * p = &bytes;
        uchar4 bytesr = ((uchar4 *)p)[0].wzyx;
        int * p2=(int *)&bytesr;
        return *p2;
    }

    float reverseBytesFloat(float bytes)
    {
        void * p = &bytes;
        uchar4 bytesr = ((uchar4 *)p)[0].wzyx;
        float * p2=(float *)&bytesr;
        return *p2;
    }

    uint sizeOf2(uint structSize)
    {
        uit mult0=structSize/256;mult0++;
        return mult0*256;
    }

    typedef struct
    {
        uchar end;
        uchar sizeRelByteAdr;
        uchar adrRelByteAdr;
        uchar idRelByteAdr;
        uchar valRelByteAdr;
        uchar woRelByteAdr;
        int size;
        uint adr; 
        int id;
        float val;
        float wo;
    }nn_p_n;




    uint malloc(__global uchar * heap, __global uint * mallocCtr, int size)
    {
        return (atomic_add(mallocCtr,(uint)size)+(uint)heap);
    }

to help kernels like:

           __kernel void nn( __global uchar *heap,__global uint * mallocCtr)
                        {
                            int id=get_global_id(0);
                            if(id==0)
                            {

                                nn_p_n * np=(nn_p_n *)malloc(heap,mallocCtr,sizeOf2(sizeof(nn_p_n)));
                                np->end=endianness_of_device();
                                np->size=sizeOf2(sizeof(nn_p_n));
                                np->id=9;
                                np->val=99.9f;

                                // lets simulate different endianness
                                np->end=1-endianness_of_device();


                                np->adr=(uint)np-(uint)heap;

                                mem_fence(CLK_GLOBAL_MEM_FENCE);
                            }

                            if(id==900)
                            {
                                // simulating another device reads buffer
                                for(int i=0;i<1000000;i++){int dummy=0; dummy++; if(dummy>id) dummy++;}

                                nn_p_n n=*((nn_p_n *)&heap[0]);
                                if(n.end!=endianness_of_device())
                                {
                                    n.size=reverseBytesInt(n.size);
                                    //if(n.size!=sizeof2(sizeof(nn_p_n)))
                                    //{  return; }
                                    n.adr=reverseBytesInt(n.adr);
                                    n.val=reverseBytesFloat(n.val);
                                }
                                nn_p_n * np=(nn_p_n *)malloc(heap,mallocCtr,sizeOf2(sizeof(nn_p_n)));
                                *np = n;
                            }

                        }

because it is working for my Intel igpu with an Intel CPU at the moment and other machine with Amd CPU and Amd GPU without any endianness problems. What if I get Nvidia gpu and Amd gpu on top of Intel CPU in future?

Ofcourse in cluster computing, one needs to cover endianness cases between computers but what about a virtual os running in another os and using same device? What about that device is an fpga having multiple endianness cores(possible?)?

Last question: can an OS force all devices, even the CPU, to become same endianness?(I don't think so but could be emulated by OS at a cost of performance?)

Edit: It's impossible to preprocess a read-only data on device-side. It is overkill to postprocess on host side those elements which are marked as "written" because only 1-2 elements could have been written while whole data could be gigabytes to download to host.

According to @Gundolf Gundelfinger, there is no perfect code, there is perfect hardware setup. — huseyin tugrul buyukisik

user703016 user703016 · Accepted Answer · 2016-11-17T07:14:43

Arguments passed to kernels are guaranteed to have the correct endianness (hence all the typedefs cl_int, etc), but this is not the case for buffers. This makes sense, because the contents of buffers are completely opaque for OpenCL: only the user knows how to make sense of what's inside. Hence, it's the user's responsibility to perform potential endianness conversion before doing computations (possibly by launching a dedicated kernel).

In other words:

__kernel void vecAdd(__global float * c, ..)

Here, the value of c is guaranteed to be of the correct endianness (the pointer's bytes themselves are in the correct device order), but the bytes pointed to by c are in whatever order the user set them on the host.

why all opencl examples over internet directly accept host-side buffers?

Most programs are developed with a fairly narrow set of target platforms, where characteristics are known in advance: if they're all little-endian, why bother supporting big-endian? Portability is already a difficult problem, and I suspect that caring about endianness is, in the general case, a significant additional complexity for little added value. It simply isn't worth it for the vast majority of programs.

If you consider such a level of portability valuable, then it's your choice to implement it.

Endianness Of OpenCL devices

1 Answers