6
votes

I'm trying to learn how to use CUDA with thrust and I have seen some piece of code where the printf function seems to be used from the device.

Consider this code:

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <cstdio>

struct functor
{
  __host__ __device__
  void operator()(int val)
  {
      printf("Call for value : %d\n", val);
  }
};

int main()
{
    thrust::host_vector<int> cpu_vec(100);
    for(int i = 0 ; i < 100 ; ++i)
      cpu_vec[i] = i;
    thrust::device_vector<int> cuda_vec = cpu_vec; //transfer to GPU
    thrust::for_each(cuda_vec.begin(),cuda_vec.end(),functor());
}

this seems to run fine and prints 100 times the message "Call for value : " followed by a number.

now if I include iostream and replace the printf line with a C++ stream-based equivalent

std::cout << "Call for value : " << val << std::endl;

I get compilation warnings from nvcc and the compiled program will not print anything.

warning: address of a host variable "std::cout" cannot be directly taken in a device function
warning: calling a __host__ function from a __host__ __device__ function is not allowed
warning: calling a __host__ function("std::basic_ostream<char, std::char_traits<char> >::operator <<") from a __host__ __device__ function("functor::operator ()") is not allowed
  1. Why does it work with printf?
  2. Why does it not work with cout?
  3. What is actually run on GPU? I'd guess, at least sending to stdout requires some CPU work.
1
printf is "overloaded" as a __device__ function, while cout is not. You need explicit "overloading" of printing functions because you need to properly deal with the output buffer. Have a look at the simplePrintf example and you will have a feeling on why you need explicit overloading and how you can do it. Since cout is a __host__ function only, nvcc cannot compile it.Vitality

1 Answers

10
votes
  1. Why does it work with printf?

Because NVIDIA added runtime support for in-kernel printf for all hardware which supports the device ABI (compute capability >= 2.0). There is a template overload of the host printf in device code which provides (almost) standard C style printf functionality. You must include cstdio or stdio.h in your device code for this mechanism to work.

  1. Why does it not work with cout?

Because NVIDIA haven't implemented any form of C++ iostream style I/O support within the CUDA device runtime.

  1. What is actually run on GPU?

The device runtime maintains a FIFO buffer for kernel code to write to via printf calls during kernel execution. The device buffer is copied by the CUDA driver and echoed to stdout at the end of kernel execution. The exact heuristics and mechanism are not documented, but I would assume that format strings and output are stored to the FIFO buffer and then parsed by the CPU driver and then printed via some sort of callback from the kernel launch API. The runtime API provides a function for controlling the size of the printf FIFO.