Thrust equivalent of Open MP code

Question

The code i'm trying to parallelize in open mp is a Monte Carlo that boils down to something like this:

int seed = 0;
std::mt19937 rng(seed); 
double result = 0.0;
int N = 1000;

#pragma omp parallel for
for(i=0; x < N; i++)
{
    result += rng()
}
std::cout << result << std::endl;

I want to make sure that the state of the random number generator is shared across threads, and the addition to the result is atomic.

Is there a way of replacing this code with something from thrust::omp. From the research that I did so far it looks like thrust::omp is more of a directive to use multiple CPU threads rather than GPU for some standard thrust operations.

Do you want to run the code on the GPU? Or just on the CPU? The thrust OMP device backend does not use the GPU. Instead of "multiple CPU threads rather than CPU" did you intend to say "multiple CPU threads rather than GPU" ? — Robert Crovella
I edited the question. Your observation is correct, my calculations are too complicated to run on GPU, instead I want to run on multiple threads. — Boraxis

Robert Crovella Robert Crovella · Accepted Answer · 2014-08-14T17:57:21

Yes, it's possible to use thrust to do something similar, with (parallel) execution on the host CPU using OMP threads underneath the thrust OMP backend. Here's one example:

$ cat t535.cpp
#include <random>
#include <iostream>
#include <thrust/system/omp/execution_policy.h>
#include <thrust/system/omp/vector.h>
#include <thrust/reduce.h>

int main(int argc, char *argv[]){
  unsigned N = 1;
  int seed = 0;
  if (argc > 1)  N = atoi(argv[1]);
  if (argc > 2)  seed = atoi(argv[2]);
  std::mt19937 rng(seed);
  unsigned long result = 0;

  thrust::omp::vector<unsigned long> vec(N);
  thrust::generate(thrust::omp::par, vec.begin(), vec.end(), rng);
  result = thrust::reduce(thrust::omp::par, vec.begin(), vec.end());
  std::cout << result << std::endl;
  return 0;
}
$ g++ -std=c++11 -O2 -I/usr/local/cuda/include -o t535 t535.cpp -fopenmp -lgomp
$ time ./t535 100000000
214746750809749347

real    0m0.700s
user    0m2.108s
sys     0m0.600s
$

For this test I used Fedora 20, with CUDA 6.5RC, running on a 4-core Xeon CPU (netting about a 3x speedup based on time results). There are probably some further "optimizations" that could be made for this particular code, but I think they will unnecessarily clutter the idea, and I assume that your actual application is more complicated than just summing random numbers.

Much of what I show here was lifted from the thrust direct system access page but there are several comparable methods to access the OMP backend, depending on whether you want to have a flexible, retargettable code, or you want one that specifically uses the OMP backend (this one specifically targets OMP backend).

The thrust::reduction operation guarantees the "atomicity" you are looking for. Specifically, it guarantees that two threads are not trying to update a single location at the same time. However the use of std::mt19937 in a multithreaded OMP app is outside the scope of my answer, I think. If I create an ordinary OMP app using the code you provided, I observe variability in the results due (I think) to some interaction between the use of the std::mt19937 rng in multiple OMP threads. This is not something thrust can sort out for you.

Thrust also has random number generators, which are designed to work with it.

Thrust equivalent of Open MP code

1 Answers