0
votes

Ive written a dotproduct kernel for opencl in C++ and it is working for vector length 4096(also tried 12k elements and working flawlessly) but when I increase the vector length to 16k elements, the result becomes infinity while it should not go beyond a small float number. Clearly there is a leak or something similar but it works ok for n<16k elements. 16k elements and 4 Byte each makes 64kB, three buffers sum to 192kB and this is not even 1/1000th of memory of gpu. Compared the result with the same reduction algorithm for host-code(C#) and host result is small as expected. No precision errors to build up infinity also(it may even get capped at a certain finite value).

Here is kernel(Ln= local work size, n= Global work size) from C# passed to C++ through DLL-call:

"__kernel void SkalarCarpim(__global float * v1, __global float * v2, __global float * v3)" +
            "{" +
            "    int i = get_global_id(0);" +
            "    int j = get_local_id(0);" +
            "    __local float biriktirici [" + Ln.ToString() + "];" +
            "    barrier(CLK_LOCAL_MEM_FENCE);" +
            "    biriktirici[j]=v1[i]*v2[i];" +
            "    barrier(CLK_LOCAL_MEM_FENCE);" +
            "    barrier(CLK_GLOBAL_MEM_FENCE);" +
            "    float toplam=0.0f;" +
            "    if(j==0)" +
            "    {" +
            "        for(int k=0;k<"+Ln.ToString()+";k++)"+ // reduction
            "        {"+
            "             toplam+=biriktirici[k];"+
            "        }"+
            "    }" +
            "    barrier(CLK_GLOBAL_MEM_FENCE);" +
            "    v3[i]=toplam;" +
            "    barrier(CLK_GLOBAL_MEM_FENCE);" +
            "    toplam=0.0f;" +
            "    for(int k=0;k<"+(n/Ln).ToString()+";k++)" + 
            "    {" +
            "         toplam+=v3[k*"+Ln.ToString()+"];       " + // sum of temporary sums
            "    }" +
            "    v3[i]=toplam;"+
            "}";

Here are the C++ Opencl buffers:

buf1=cl::Buffer(altYapi,CL_MEM_READ_WRITE,sizeof(cl_float) * N);
buf2=cl::Buffer(altYapi,CL_MEM_READ_WRITE,sizeof(cl_float) * N);
buf3=cl::Buffer(altYapi,CL_MEM_READ_WRITE,sizeof(cl_float) * N);
//CL_MEM_READ_ONLY makes same error, tried some other too, no solution :(

Here is how buffers are sent:

komutSirasi.enqueueWriteBuffer(buf1,CL_TRUE,0,sizeof(cl_float)*N,v1);
komutSirasi.enqueueWriteBuffer(buf2,CL_TRUE,0,sizeof(cl_float)*N,v2);
//CL_TRUE makes a blocking action so waits until finished

Execution:

 komutSirasi.enqueueNDRangeKernel(kernel,0,Global,Local);
 //I got this from an example and I dont know if it is blocking or not.

Here is how result buffer is taken(all elements are the result, I know its unfinished):

komutSirasi.enqueueReadBuffer(buf3,CL_TRUE,0,sizeof(cl_float) * N,v3);
//CL_TRUE makes a blocking action so waits until finished

Question: Is there a cofiguration that I must do before diving into C++ Opencl? This was not an issue in Java/Aparapi/Jocl.

Using Opencl 1.2 headers from Khronos' site and AMD Opencl.lib + Opencl.dll for this if it helps(target device is HD7870).

1

1 Answers

3
votes

Your second reduction, sum of v3[k*N], assumes all values in v3 have already been computed. This would require synchronization between different workgroups, which is not possible in the general case. It may accidentally happen when there is one single workgroup.

After the first reduction, you should store toplam in v3[get_group_id(0)], and then run a second kernel for the second reduction.