21
votes

I would like to iterate through all elements in an std::list in parallel fashion using OpenMP. The loop should be able to alter the elements of the list. Is there a simple solution for this? It seems that OpenMP 3.0 supports parallel for loops when the iterator is a Random Access Iterator, but not otherwise. In any case, I would prefer to use OpenMP 2.0 as I don't have full control over which compilers are available to me.

If my container were a vector, I might use:

#pragma omp parallel for
for (auto it = v.begin(); it != v.end(); ++it) {
    it->process();
}

I understand that I could copy the list into a vector, do the loop, then copy everything back. However, I would like to avoid this complexity and overhead if possible.

5

5 Answers

32
votes

If you decide to use Openmp 3.0, you can use the task feature:

#pragma omp parallel
#pragma omp single
{
  for(auto it = l.begin(); it != l.end(); ++it)
     #pragma omp task firstprivate(it)
       it->process();
  #pragma omp taskwait
}

This will execute the loop in one thread, but delegate the processing of elements to others.

Without OpenMP 3.0 the easiest way would be writing all pointers to elements in the list (or iterators in a vector and iterating over that one. This way you wouldn't have to copy anything back and avoid the overhead of copying the elements themselves, so it shouldn't have to much overhead:

std::vector<my_element*> elements; //my_element is whatever is in list
for(auto it = list.begin(); it != list.end(); ++it)
  elements.push_back(&(*it));

#pragma omp parallel shared(chunks)
{
  #pragma omp for
  for(size_t i = 0; i < elements.size(); ++i) // or use iterators in newer OpenMP
      elements[i]->process();
}

If you want to avoid copying even the pointers, you can always create a parallelized for loop by hand. You can either have the threads access interleaved elements of the list (as proposed by KennyTM) or split the range in roughly equal contious parts before iterating and iterating over those. The later seems preferable since the threads avoid accessing listnodes currently processed by other threads (even if only the next pointer), which could lead to false sharing. This would look roughly like this:

#pragma omp parallel
{
  int thread_count = omp_get_num_threads();
  int thread_num   = omp_get_thread_num();
  size_t chunk_size= list.size() / thread_count;
  auto begin = list.begin();
  std::advance(begin, thread_num * chunk_size);
  auto end = begin;
  if(thread_num = thread_count - 1) // last thread iterates the remaining sequence
     end = list.end();
  else
     std::advance(end, chunk_size);
  #pragma omp barrier
  for(auto it = begin; it != end; ++it)
    it->process();
}

The barrier is not strictly needed, however if process mutates the processed element (meaning it is not a const method), there might be some sort of false sharing without it, if threads iterate over a sequence which is already being mutated. This way will iterate 3*n times over the sequence (where n is the number of threads), so scaling might be less then optimal for a high number of threads.

To reduce the overhead you could put the generation of the ranges outside of the #pragma omp parallel, however you will need to know how many threads will form the parallel section. So you'd probably have to manually set the num_threads, or use omp_get_max_threads() and handle the case that the number of threads created is less then omp_get_max_threads() (which is only an upper bound). The last way could be handled by possibly assigning each thread severa chunks in that case (using #pragma omp for should do that):

int max_threads = omp_get_max_threads();
std::vector<std::pair<std::list<...>::iterator, std::list<...>::iterator> > chunks;
chunks.reserve(max_threads); 
size_t chunk_size= list.size() / max_threads;
auto cur_iter = list.begin();
for(int i = 0; i < max_threads - 1; ++i)
{
   auto last_iter = cur_iter;
   std::advance(cur_iter, chunk_size);
   chunks.push_back(std::make_pair(last_iter, cur_iter);
}
chunks.push_back(cur_iter, list.end();

#pragma omp parallel shared(chunks)
{
  #pragma omp for
  for(int i = 0; i < max_threads; ++i)
    for(auto it = chunks[i].first; it != chunks[i].second; ++it)
      it->process();
}

This will take only three iterations over list (two, if you can get the size of the list without iterating). I think that is about the best you can do for non random access iterators without using tasks or iterating over some out of place datastructure (like a vector of pointer).

4
votes

I doubt it is possible since you can't just jump into the middle of a list without traversing the list. Lists are not stored in contiguous memory and std::list iterators are not Random Access. They are only bidirectional.

2
votes

http://openmp.org/forum/viewtopic.php?f=3&t=51

#pragma omp parallel
{
   for(it= list1.begin(); it!= list1.end(); it++)
   {
      #pragma omp single nowait
      {
         it->compute();
      }
   } // end for
} // end ompparallel

This can be understood as unrolled as:

{
  it = listl.begin
  #pragma omp single nowait
  {
    it->compute();
  }
  it++;
  #pragma omp single nowait
  {
    it->compute();
  }
  it++;
...
}

Given a code like this:

int main()                                                                      
{                                                                               
        std::vector<int> l(4,0);                                                
        #pragma omp parallel for                                                        
        for(int i=0; i<l.size(); ++i){                                          
                printf("th %d = %d \n",omp_get_thread_num(),l[i]=i);            
        }                                                                       
        printf("\n");                                                           
       #pragma omp parallel                                                            
        {                                                                       
                for (auto i = l.begin(); i != l.end(); ++i) {                   
               #pragma omp single nowait                                                       
                {                                                       
                        printf("th %d = %d \n",omp_get_thread_num(),*i);
                }                                                       
            }                                                               
        }                                                                       
        return 0;                                                               
} 

export OMP_NUM_THREADS=4, output as follows (note the second section, work thread number can repeat):

th 2 = 2 
th 1 = 1 
th 0 = 0 
th 3 = 3 

th 2 = 0 
th 1 = 1 
th 2 = 2 
th 3 = 3
0
votes

Without using OpenMP 3.0 you have the option of having all threads iterating over the list:

std::list<T>::iterator it;
#pragma omp parallel private(it)
{
   for(it = list1.begin(); it!= list1.end(); it++)
   {
      #pragma omp single nowait
      {
         it->compute();
      }
   } 
} 

In this case each thread has its own copy of the iterator (private) but only a single thread will access a specific element (single) whereas the other threads will move forward to the next items (nowait)

Or you can loop once to build a vector of pointers to be then distributed among threads:

std::vector< T*> items;

items.reserve(list.size());
//put the pointers in the vector
std::transform(list.begin(), list.end(), std::back_inserter(items), 
               [](T& n){ return &n; }
);

#pragma omp parallel for
for (int i = 0; i < items.size(); i++)
{
  items[i]->compute();
}

Depending on your specific case one or the other can be faster. Testing which one suits you better is easy.

0
votes

Here is a solution which allows inserting/removing new elements of a list in parallel.

For a list with N elements we first cut the list into nthreads lists with roughly N/nthreads elements. In a parallel region this can be done like this

int ithread = omp_get_thread_num();
int nthreads = omp_get_num_threads();
int t0 = (ithread+0)*N/nthreads;
int t1 = (ithread+1)*N/nthreads;

std::list<int> l2;
#pragma omp for ordered schedule(static)
for(int i=0; i<nthreads; i++) {
    #pragma omp ordered
    {
        auto it0 = l.begin(), it1 = it0;
        std::advance(it1, t1-t0);       
        l2.splice(l2.begin(), l2, it0, it1);
    }
}

Where l2 is the cut list for each thread.

Then we can act on each list in parallel. For example we can insert -1 every first position in the list like this

auto it = l2.begin();
for(int i=(t0+4)/5; i<(t1+4)/5; i++) {
    std::advance(it, 5*i-t0);
    l2.insert(it, -1);
}

Finally, after we are doing operating on the lists in parallel we splice the lists for each thread back to one list in order like this:

#pragma omp for ordered schedule(static)
for(int i=0; i<nthreads; i++) {
    #pragma omp ordered
    l.splice(l.end(), l, l2.begin(), l2.end());
}

The algorithm is essentially.

  1. Fast-forward through list sequential making cut lists.
  2. Act on cut lists in parallel adding, modifying, or removing elements.
  3. Splice the modified cut lists back together sequential.

Here is a working example

#include <algorithm>
#include <iostream>
#include <list>
#include <omp.h>

int main(void) {
  std::list<int> l;
  for(int i=0; i<22; i++) {
    l.push_back(i);
  }
  for (auto it = l.begin(); it != l.end(); ++it) {
    std::cout << *it << " ";
  } std::cout << std::endl;

  int N = l.size();
  #pragma omp parallel
  {
    int ithread = omp_get_thread_num();
    int nthreads = omp_get_num_threads();
    int t0 = (ithread+0)*N/nthreads;
    int t1 = (ithread+1)*N/nthreads;

    //cut list into nthreads lists with size=N/nthreads
    std::list<int> l2;
    #pragma omp for ordered schedule(static)
    for(int i=0; i<nthreads; i++) {
      #pragma omp ordered
      {
    auto it0 = l.begin(), it1 = it0;
    std::advance(it1, t1-t0);       
    l2.splice(l2.begin(), l2, it0, it1);
      }
    }
    //insert -1 every 5th postion
    auto it = l2.begin();
    for(int i=(t0+4)/5; i<(t1+4)/5; i++) {
      std::advance(it, 5*i-t0);
      l2.insert(it, -1);
    }

    //splice lists in order back together.
    #pragma omp for ordered schedule(static)
    for(int i=0; i<nthreads; i++) {
      #pragma omp ordered
      l.splice(l.end(), l, l2.begin(), l2.end());
    }  
  }

  for (auto it = l.begin(); it != l.end(); ++it) {
    std::cout << *it << " ";
  } std::cout << std::endl;  
}

Result

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 
-1 0 1 2 3 4 -1 5 6 7 8 9 -1 10 11 12 13 14 -1 15 16 17 18 19 -1 20 21