Ive written a dotproduct kernel for opencl in C++ and it is working for vector length 4096(also tried 12k elements and working flawlessly) but when I increase the vector length to 16k elements, the result becomes infinity while it should not go beyond a small float number. Clearly there is a leak or something similar but it works ok for n<16k elements. 16k elements and 4 Byte each makes 64kB, three buffers sum to 192kB and this is not even 1/1000th of memory of gpu. Compared the result with the same reduction algorithm for host-code(C#) and host result is small as expected. No precision errors to build up infinity also(it may even get capped at a certain finite value).
Here is kernel(Ln= local work size, n= Global work size) from C# passed to C++ through DLL-call:
"__kernel void SkalarCarpim(__global float * v1, __global float * v2, __global float * v3)" +
"{" +
" int i = get_global_id(0);" +
" int j = get_local_id(0);" +
" __local float biriktirici [" + Ln.ToString() + "];" +
" barrier(CLK_LOCAL_MEM_FENCE);" +
" biriktirici[j]=v1[i]*v2[i];" +
" barrier(CLK_LOCAL_MEM_FENCE);" +
" barrier(CLK_GLOBAL_MEM_FENCE);" +
" float toplam=0.0f;" +
" if(j==0)" +
" {" +
" for(int k=0;k<"+Ln.ToString()+";k++)"+ // reduction
" {"+
" toplam+=biriktirici[k];"+
" }"+
" }" +
" barrier(CLK_GLOBAL_MEM_FENCE);" +
" v3[i]=toplam;" +
" barrier(CLK_GLOBAL_MEM_FENCE);" +
" toplam=0.0f;" +
" for(int k=0;k<"+(n/Ln).ToString()+";k++)" +
" {" +
" toplam+=v3[k*"+Ln.ToString()+"]; " + // sum of temporary sums
" }" +
" v3[i]=toplam;"+
"}";
Here are the C++ Opencl buffers:
buf1=cl::Buffer(altYapi,CL_MEM_READ_WRITE,sizeof(cl_float) * N);
buf2=cl::Buffer(altYapi,CL_MEM_READ_WRITE,sizeof(cl_float) * N);
buf3=cl::Buffer(altYapi,CL_MEM_READ_WRITE,sizeof(cl_float) * N);
//CL_MEM_READ_ONLY makes same error, tried some other too, no solution :(
Here is how buffers are sent:
komutSirasi.enqueueWriteBuffer(buf1,CL_TRUE,0,sizeof(cl_float)*N,v1);
komutSirasi.enqueueWriteBuffer(buf2,CL_TRUE,0,sizeof(cl_float)*N,v2);
//CL_TRUE makes a blocking action so waits until finished
Execution:
komutSirasi.enqueueNDRangeKernel(kernel,0,Global,Local);
//I got this from an example and I dont know if it is blocking or not.
Here is how result buffer is taken(all elements are the result, I know its unfinished):
komutSirasi.enqueueReadBuffer(buf3,CL_TRUE,0,sizeof(cl_float) * N,v3);
//CL_TRUE makes a blocking action so waits until finished
Question: Is there a cofiguration that I must do before diving into C++ Opencl? This was not an issue in Java/Aparapi/Jocl.
Using Opencl 1.2 headers from Khronos' site and AMD Opencl.lib + Opencl.dll for this if it helps(target device is HD7870).