4
votes

I'm moving a simulation into pyOpenCL and can't get my data access to work. I'm trying to supply a 1D array of vectors (well, actually several, but the example I've included just used one).

Currently, several vectors are copied over just fine, but then the data is simply not what I supplied.

I don't think I've posted here before, so apologies if any of the formatting/presentation is wrong. Also, I've just stripped out all the simulation code, so I realise this code is currently not actually doing anything, I just want to get the buffer passing correct.

Thanks in advance.

The kernel (kertest.py):

step1 = """
#pragma OPENCL EXTENSION cl_amd_printf: enable
#define X xdim
#define Y ydim
__kernel void k1(__global float3 *spins,
                 __local float3 *tile)
{        
    ushort lid = 2 * get_local_id(0);
    ushort group = 2 * get_group_id(0);
    ushort num = get_num_groups(0);
    int lim = X*Y*3;

    for (ushort i = 0; i < lim; i++)
        {
            if (lid == 0 && group == 0)
            {
                printf("%f :: %d\\n", spins[i].x, i);
            }
         }
}"""

The code itself (gputest.py):

import kertest as k2D
import numpy as np
import pyopencl as cl

class GPU_MC2DSim():
    def __init__(self, x, y):
        self.x = x
        self.y = y

        if x >= y:
            self.xdim = int(self.x)
            self.ydim = int(self.y)
        else:
            self.xdim = int(self.y)
            self.ydim = int(self.x)

        if self.xdim % 2 != 0: self.xdim += 1

        if self.ydim % 2 != 0: self.ydim += 1

        self.M = np.ones((self.xdim*self.ydim, 3)).astype(np.float32)
        self.M[:, 1] += 1.0
        self.M[:, 2] += 2.0

        print self.M

    def simulate(self):
        ctx = cl.create_some_context()
        q = cl.CommandQueue(ctx)
        mf = cl.mem_flags

        #Pass buffer:
        M_buf = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf = self.M)

        #Insert kernel parameters:
        params = {"xdim" : "%d" % (self.xdim),
                  "ydim" : "%d" % (self.ydim),
                  }
        for name in params:
            k2D.step1 = k2D.step1.replace(name, params[name])

        #Compile kernel:
        step1 = cl.Program(ctx, k2D.step1).build()

        locmem = cl.LocalMemory(self.xdim*4*4)

        step1.k1(q, ((self.xdim*self.ydim)/4,), (self.xdim/2,), M_buf, locmem).wait()
        return None

xdim = 4
ydim = 4
sim = GPU_MC2DSim(xdim, ydim)
sim.simulate()
1

1 Answers

4
votes

Your code for copying the data to the device is just fine. However, your kernel has at least two problems:

  1. float3 values are expected to be 16-byte aligned, as per OpenCL 1.2 Spec, 6.1.5:

    For 3-component vector data types, the size of the data type is 4 * sizeof(component). This means that a 3-component vector data type will be aligned to a 4 * sizeof(component) boundary. The vload3 and vstore3 built-in functions can be used to read and write, respectively, 3-component vector data types from an array of packed scalar data type.

    The values you upload to the devices are not properly aligned for the kernel to read float3 values directly.

  2. Your limit calculation int lim = X*Y*3; is slightly off. You are already trying to read from an array of float3, so the *3 is superfluous.

The solution to both problems is simple: as stated in the spec, you should use vload3 to load from an array of floats:

#pragma OPENCL EXTENSION cl_amd_printf: enable
#define X xdim
#define Y ydim
__kernel void k1(__global float *spins,
                 __local float3 *tile)
{
    ushort lid = 2 * get_local_id(0);
    ushort group = 2 * get_group_id(0);
    ushort num = get_num_groups(0);
    int lim = X*Y;

    for (ushort i = 0; i < lim; i++)
        {
            if (lid == 0 && group == 0)
            {
                float3 vec = vload3(i, spins);
                printf("(%f, %f, %f) :: %d\\n", vec.x, vec.y, vec.z, i);
            }
         }
}