C++ manipulation of HUGE vector using threads: can it be optimized further?

Question

I have a very simple class that contains a single integer. The class has a single function that increments that integer by one. Basically:

class foo
{
public:
   int bar;
   foo(){bar = 1;}
   void inc() {bar++;}
};

I also have a vector that contains 500 million instances of that class (as objects, not pointers).

I have a thread pool that waits to process that vector, with 16 threads (the number of cores on my machine). When I want to process the data, I make the threads run this function (where start/end are the portions of the vector each thread should process, ie...the vector divided evenly into [number of threads] sections):

void threadFn(vector<foo> &arr, int startInx, int endInx)
{
    for (int i = startInx; i < endInx; i++)
    {
        foo &f = arr[i];
        f.inc();
    }
}

If I run this function on only 1 thread, it returns in ~750ms (which excludes the time it took to construct the vector). If I run this function on all 16 threads, it returns in ~100ms. So I'm getting 7.5x the speed, which is nice...but I want to know if there's anything I can do to push it further.

I've read a bit about false sharing, but am not exactly sure how to translate that problem into a practical optimization here in order to avoid it.

Basically I'm looking for any ideas that I can use to help to scale the speed of this algorithm up better with the number of cores used. Or is that not possible?

It might be worth noting beforehand, that my thread pool implementation is not part of the bottleneck. If I strip the loop out of the threadFn, the pool finishes its work in <1ms.

Maybe get rid of the foo class altogether and replace it with a mere int? You can also simply do arr[i].inc(). This may make the code more easily readable. — ForceBru
From your description you a probably limited by main memory to CPU bandwidth. There are a few platform specific tools that can investigate this. I would start by getting timings for 1, 2, 3 etc threads and graphing them. You could also calculate the data transferred and compare this to your system's theoretical limits. — Richard Critten
@ForceBru Eventually I'll be adding other data to the class and other functions, so I can't strip it out and just replace it with the raw integers. I grab the reference for the same reason - eventually I'll be calling other foo functions in the loop and repeated vector index access would be a slowdown. Right now I'm trying to squeeze as much performance out of the current setup before adding those things though. — Tyson
Also have you considered some of the GPU parallel processing libraries. — Richard Critten
@RichardCritten I haven't, but wouldn't the time it takes to copy the data from cpu->gpu and back negate any potential gpu speedups? I need cpu access to the data after each full loop completion. — Tyson

Yakk - Adam Nevraumont Yakk - Adam Nevraumont · Accepted Answer · 2017-03-30T19:46:41

Speed wise, your code is going to be fine; you are almost certainly memory IO bound.

Design wise, you probably want a span type rather than passing around a vector and a pair of ints.

The gsl has a span which represents a chunk of contiguous memory by storing two pointers. Writing one yourself is easy:

template<class T>
struct span {
  T* b=0; T* e=0;
  T* begin() const { return b; }
  T* end() const { return b; }
  span( T* s, T* f ):b(s),e(f) {}
  span( T* s, std::size_t length):span(s, s+length) {}

  span() = default;
  span(span const&) = default;
  span& operator=(span const&) = default;
  ~span() = default;

  template<class U, std::size_t N,
    class=std::enable_if_t< std::is_same<std::decay_t<U>, std::decay_t<T>>::value>
  >
  span( U(& arr)[N] ):span(arr, N) {}
  template<class C,
    class=decltype( std::declval<C>().data() ),
    class=std::enable_if_t< !std::is_same<std::decay_t<C>, span>::value >
  >
  span( C&& c ):span(c.data(), c.size()) {}

  // container-like operations:
  bool empty() const { return begin()==end(); }
  std::size_t size() const { return end()-begin(); }
  T& front() const { return *begin(); }
  T& back() const { return *std::prev(end()); }
  T* data() const { return begin(); }

  // create child spans:
  span without_front( std::size_t N=1 ) const {
    return { begin()+(std::min)(N, size()), end() };
  }
  span without_back( std::size_t N=1 ) const {
    return { begin(), end()-(std::min)(N, size()) };
  }
};

now your code looks slicker:

void threadFn(span<Foo> foos)
{
  for (foo& f:foos)
  { 
    f.inc();
  }
}

as a bonus, non-vector stored data now works, like a std::array or anything else.

span implicitly converts from any std container with a .data() member that returns a pointer.

If your eventual problem involves more complex operations, and less data, you'll find you are less memory bound and more CPU bound.

This will increase your speedup beyond 7.5x over single threaded. Eventually, as computation dominates over bandwidth, you could consider moving your work to the GPU; the time required to get to/from the GPU becomes the limiting factor.

You can usually determine if a process is CPU or memory bound by examining its growth rate in speed as you add more CPUs, assuming you have avoided contention in its implementation. Often you can even see an inflection point.

If your problem involves multiple stages, stacking them so that they occur "more locally" may eventually improve this as well.

C++ manipulation of HUGE vector using threads: can it be optimized further?

1 Answers