I am new to stackoverflow, sycl and gpu-programming. I have a project with a working basic sycl kernel. The logic is working so I'm skipping it in the question. Also during the compiling and execution is no error.
The big problem is now that the call of the sycl code is very slow. First I thought it was some memory copying or similar, so I left out anything but what you can see below (the bare minimum, comments are where code would be located when not minimum kernel).
My measured times: (Release x64)
- with Visual Studio Debugger shown, total time of function with empty kernel call: ~100 ms
- with Cuda Nsight, time of OpenCl-kernel execution: ~5 us
The kernel gpu time of 5 us is very fast as expected with an empty kernel. But the total time of the c++ function in my Code is with 100 ms slow.
What could be the problem here? Or is the sycl overhead expected to be this slow?(I really doubt that)
My efforts:
- I changed my compute++.exe flags from -O2 to -O3 what improved the total time about 5 to 10 ms.
- I made the kernel bare minimum
The code inside a dll function:
{ //scope
sycl::gpu_selector gpuSel;
sycl::queue myQueue(gpuSel);
//....buffers
auto ra = range<1>(size);
myQueue.submit([&](sycl::handler& hd)
{
//....get_access<access::mode::read>
auto kernel = ([=](cl::sycl::id<1> id)
{
//...some vector math
});
hd.parallel_for<someName>(ra, kernel);
});
myQueue.wait();
}
I am using:
- Visual Studio 2019
- ComputeCpp Community 2.0.0
- Latest Cuda Drivers
- NVIDIA Gtx 980 ptx64 (experimental ComputeCpp support)
compute++ call:
"..\compute++.exe" -sycl -D_ALLOW_COMPILER_AND_STL_VERSION_MISMATCH -O3 -mllvm -inline-threshold=1000 -intelspirmetadata -sycl-target ptx64 -std=c++14 -I"../Codeplay/ComputeCpp/include" -I"../NVIDIA GPU Computing Toolkit/CUDA/v10.2/include" -sycl-ih something.cpp.sycl -c something.cpp
Summarized:
The total execution time of a sycl kernel is slow.
Can I do something here to improve it or is it because of the implementation of sycl/computecpp on Nvidia gpus and is expected to be this slow?