2
votes

I am new to stackoverflow, sycl and gpu-programming. I have a project with a working basic sycl kernel. The logic is working so I'm skipping it in the question. Also during the compiling and execution is no error.

The big problem is now that the call of the sycl code is very slow. First I thought it was some memory copying or similar, so I left out anything but what you can see below (the bare minimum, comments are where code would be located when not minimum kernel).

My measured times: (Release x64)

  • with Visual Studio Debugger shown, total time of function with empty kernel call: ~100 ms
  • with Cuda Nsight, time of OpenCl-kernel execution: ~5 us

The kernel gpu time of 5 us is very fast as expected with an empty kernel. But the total time of the c++ function in my Code is with 100 ms slow.

What could be the problem here? Or is the sycl overhead expected to be this slow?(I really doubt that)

My efforts:

  • I changed my compute++.exe flags from -O2 to -O3 what improved the total time about 5 to 10 ms.
  • I made the kernel bare minimum

The code inside a dll function:

 { //scope

    sycl::gpu_selector gpuSel;
    sycl::queue myQueue(gpuSel);

    //....buffers

    auto ra = range<1>(size);

    myQueue.submit([&](sycl::handler& hd)
    {
            //....get_access<access::mode::read>

            auto kernel = ([=](cl::sycl::id<1> id)
            {
                    //...some vector math
            });

            hd.parallel_for<someName>(ra, kernel);             
    });

    myQueue.wait();
}

I am using:

  • Visual Studio 2019
  • ComputeCpp Community 2.0.0
  • Latest Cuda Drivers
  • NVIDIA Gtx 980 ptx64 (experimental ComputeCpp support)

compute++ call:

"..\compute++.exe" -sycl -D_ALLOW_COMPILER_AND_STL_VERSION_MISMATCH -O3 -mllvm -inline-threshold=1000 -intelspirmetadata -sycl-target ptx64 -std=c++14 -I"../Codeplay/ComputeCpp/include" -I"../NVIDIA GPU Computing Toolkit/CUDA/v10.2/include"  -sycl-ih something.cpp.sycl -c something.cpp

Summarized:
The total execution time of a sycl kernel is slow. Can I do something here to improve it or is it because of the implementation of sycl/computecpp on Nvidia gpus and is expected to be this slow?

2

2 Answers

2
votes

First I would point out that this is a very simple set of SYCL code so if you are looking to measure performance it's probably not a very relevant example. Here's a research paper showing comparable performance of ComputeCpp with CUDA doing a reduction algorithm benchmark, see slide 40 for the chart. You'll also see in the presentation that the performance increase goes up exponentially based on the size of the data set being worked on. That is generally the same for HPC programming as the benefits of a GPU are generally only seen when processing larger data sets.

The difference you are seeing is because ComputeCpp uses OpenCL callbacks, and the NVIDIA OpenCL driver does seem to introduce an overhead when using these callbacks. Here's a relevant post about this from a while back

If you were to write a simple OpenCL kernel that uses callbacks it would exhibit the same sort of behaviour.

I'd also add that we've implemented NVIDIA support for the DPC++ compiler that uses CUDA directly and does not see the same level of overhead. You can find out more about that in our blog post, it would be worth giving that a try if you want to run SYCL code on NVIDIA hardware.

0
votes

GPUs are terrible when you want to add or multiply 3 or 4 numbers. For that you better use the CPU it is optimized for that and you may have an AVX extension which is optimized to do vector math. So for that you should replace cl::sycl::gpu_selector with cl::sycl::cpu_selector. I'm not sure if sycl uses AVX when you have one, put it will definitely use multi threading.

But when you're trying to add 500'000 numbers, then will the GPU be much faster than the CPU.

This video explains it very well.