4
votes

I've read this description of the OpenCL 2.x pipe API and leaded through the Pipe API pages at khronos.org. I felt kind of jealous, working in CUDA almost exclusively, of this nifty feature available only in OpenCL (and sorry that CUDA functionality has not been properly subsumed by OpenCL, but that's a different issue), so I thought I'd ask "How come CUDA doesn't have a pipe mechanism". But then I realized I don't even know what that would mean exactly. So, instead, I'll ask:

  1. How do OpenCL pipes work on AMD discrete GPUs / APUs? ...

    • What info gets written where?
    • How does the scheduling of kernel workgroups to cores effected by the use of pipes?
    • Do piped kernels get compiled together (say, their SPIR forms)?
    • Does the use of pipes allow passing data between different kernels via the core-specific cache ("local memory" in OpenCL parlance, "shared memory" in CUDA parlance)? That would be awesome.
  2. Is there a way pipes are "supposed" to work on a GPU, generally? i.e. something the API authors envisioned or even put in writing?
  3. How do OpenCL pipes work in CPU-based OpenCL implementations?
1

1 Answers

5
votes

OpenCL pipes were introduced along with OpenCL 2.0. On GPUs the OpenCL pipes is just like a global memory buffer with controlled access i.e you can limit the number of workgroups that are allowed to write/read to/from a pipe simultaneously. This kind of allows us to re-use the same buffer or pipe without worrying about conflicting reads or writes from multiple workgroups. As far as i know OpenCL pipes do not use GPU local memory. But if you carefully adjust the size of the pipe then you can increase the cache hits thus achieving better overall performance. There is no general rule as to when pipes should be used. I use pipes to pass data between 2 concurrently running kernels so that my program achieves better overall performance due to better cache hit ratio. This is the same way OpenCL pipes work in CPU as well (its just a global buffer which might fit in the system cache if its small enough). But on devices like FPGAs they work in a different manner. The pipes makes use of the local memory instead of the global memory in these devices and hence achieves considerable higher performance over using a global memory buffer.