0
votes

Apart from precision / accuracy / alignment inconsistencies between a gpu, a cpu and a customized opencl chip, why all opencl examples over internet directly accept host-side buffers?

clEnqueueWriteBuffer (  cl_command_queue command, ...)

__kernel void vecAdd(__global float * c, ..)
{
   int id=get_global_id(0);
   c[id]=b[id]+a[id]
}

lets assume host is little endian but device is big endian and a,b,c, are floats:

  • Would b and a be loaded as non-normalized garbage?

    • if yes, do I have to send a single byte telling kernel about buffer endianness?
    • if no, are they converted(using 4 byte SIMD shuffling of CPU? Or at GPU?) during buffer read-write operations automatically? Even for USE_HOST_PTR(RAM) type buffers and CL_MEM_READ_WRITE(GDDR)?
    • If unknown, would below example functions work always?

          int endianness_of_device()
          {
                  unsigned int x = 1;
                  return ((  (  (char *)&x) [0])==0?0:1) ;
          }
      
          int reverseBytesInt(int bytes)
          {
              void * p = &bytes;
              uchar4 bytesr = ((uchar4 *)p)[0].wzyx;
              int * p2=(int *)&bytesr;
              return *p2;
          }
      
          float reverseBytesFloat(float bytes)
          {
              void * p = &bytes;
              uchar4 bytesr = ((uchar4 *)p)[0].wzyx;
              float * p2=(float *)&bytesr;
              return *p2;
          }
      
          uint sizeOf2(uint structSize)
          {
              uit mult0=structSize/256;mult0++;
              return mult0*256;
          }
      
          typedef struct
          {
              uchar end;
              uchar sizeRelByteAdr;
              uchar adrRelByteAdr;
              uchar idRelByteAdr;
              uchar valRelByteAdr;
              uchar woRelByteAdr;
              int size;
              uint adr; 
              int id;
              float val;
              float wo;
          }nn_p_n;
      
      
      
      
          uint malloc(__global uchar * heap, __global uint * mallocCtr, int size)
          {
              return (atomic_add(mallocCtr,(uint)size)+(uint)heap);
          }
      

to help kernels like:

           __kernel void nn( __global uchar *heap,__global uint * mallocCtr)
                        {
                            int id=get_global_id(0);
                            if(id==0)
                            {

                                nn_p_n * np=(nn_p_n *)malloc(heap,mallocCtr,sizeOf2(sizeof(nn_p_n)));
                                np->end=endianness_of_device();
                                np->size=sizeOf2(sizeof(nn_p_n));
                                np->id=9;
                                np->val=99.9f;

                                // lets simulate different endianness
                                np->end=1-endianness_of_device();


                                np->adr=(uint)np-(uint)heap;

                                mem_fence(CLK_GLOBAL_MEM_FENCE);
                            }

                            if(id==900)
                            {
                                // simulating another device reads buffer
                                for(int i=0;i<1000000;i++){int dummy=0; dummy++; if(dummy>id) dummy++;}

                                nn_p_n n=*((nn_p_n *)&heap[0]);
                                if(n.end!=endianness_of_device())
                                {
                                    n.size=reverseBytesInt(n.size);
                                    //if(n.size!=sizeof2(sizeof(nn_p_n)))
                                    //{  return; }
                                    n.adr=reverseBytesInt(n.adr);
                                    n.val=reverseBytesFloat(n.val);
                                }
                                nn_p_n * np=(nn_p_n *)malloc(heap,mallocCtr,sizeOf2(sizeof(nn_p_n)));
                                *np = n;
                            }

                        }

because it is working for my Intel igpu with an Intel CPU at the moment and other machine with Amd CPU and Amd GPU without any endianness problems. What if I get Nvidia gpu and Amd gpu on top of Intel CPU in future?

Ofcourse in cluster computing, one needs to cover endianness cases between computers but what about a virtual os running in another os and using same device? What about that device is an fpga having multiple endianness cores(possible?)?

Last question: can an OS force all devices, even the CPU, to become same endianness?(I don't think so but could be emulated by OS at a cost of performance?)

Edit: It's impossible to preprocess a read-only data on device-side. It is overkill to postprocess on host side those elements which are marked as "written" because only 1-2 elements could have been written while whole data could be gigabytes to download to host.

1
According to @Gundolf Gundelfinger, there is no perfect code, there is perfect hardware setup.huseyin tugrul buyukisik

1 Answers

1
votes

Arguments passed to kernels are guaranteed to have the correct endianness (hence all the typedefs cl_int, etc), but this is not the case for buffers. This makes sense, because the contents of buffers are completely opaque for OpenCL: only the user knows how to make sense of what's inside. Hence, it's the user's responsibility to perform potential endianness conversion before doing computations (possibly by launching a dedicated kernel).

In other words:

__kernel void vecAdd(__global float * c, ..)

Here, the value of c is guaranteed to be of the correct endianness (the pointer's bytes themselves are in the correct device order), but the bytes pointed to by c are in whatever order the user set them on the host.

why all opencl examples over internet directly accept host-side buffers?

Most programs are developed with a fairly narrow set of target platforms, where characteristics are known in advance: if they're all little-endian, why bother supporting big-endian? Portability is already a difficult problem, and I suspect that caring about endianness is, in the general case, a significant additional complexity for little added value. It simply isn't worth it for the vast majority of programs.

If you consider such a level of portability valuable, then it's your choice to implement it.