How does CUDA Thrust compare to a raw kernel?

Question

I'm new to GPU programming and am unsure what would lead to the most efficient code. What are the pros/cons of using Thrust vs writing a custom kernel and managing memory yourself?

If it would help to elaborate on what my goal is: I have a large matrix where for each value I need to perform a few vector operations. I know I need dynamic parallelism for this task and currently have a custom kernel to go through the matrix that will call other kernels. I'm considering whether the kernel should be replaced with a Thrust call (e.g. thrust::for_each) and/or whether I should use Thrust inside the kernel for the vector operations.

Michael Michael · Accepted Answer · 2019-07-10T01:37:13

Over the last ~12 months I've gone from writing predominantly CUDA kernels to predominantly using Thrust, and then back to writing predominantly CUDA kernels. In general, writing your own CUDA kernels should provide better raw performance, but in simpler test cases the difference should be negligible.

Thrust mimics the C++ STL, so it carries many of the same upsides and downsides as the STL. Namely, it's designed to operate on vectors of data in a very generalized way. From that perspective, Thrust is better at some things than CUDA is but shouldn't be seen as a one-size-fits-all solution. Thrust's main advantages are in areas like abstraction and portability; you don't have to think about block sizes, and it's easy to write functors that are equally applicable to data on the device or on the host whereas obviously a CUDA kernel can only operate on device memory. It also has a number of very useful algorithms; it's nice not having to write your own reduction or sort algorithms, as Thrust provides very efficient implementations of these. But under the hood your data access patterns might not easily match what Thrust was designed for, and thrust tends to perform a lot of temporary memory allocations (which in a performance context is often not good; you can hack its memory management model to cache these temporary allocations, but I don't recommend actually doing this, just write kernels instead and take full control of your memory usage yourself).

My preferred work mode right now is to use CUDA for almost everything but dipping into Thrust's algorithms for specific algorithms (e.g. sort), for prototype code, or for code where I'd like the implementation to work equally well on the host or the device.

How does CUDA Thrust compare to a raw kernel?

1 Answers