I have a very simple class that contains a single integer. The class has a single function that increments that integer by one. Basically:
class foo
{
public:
int bar;
foo(){bar = 1;}
void inc() {bar++;}
};
I also have a vector that contains 500 million instances of that class (as objects, not pointers).
I have a thread pool that waits to process that vector, with 16 threads (the number of cores on my machine). When I want to process the data, I make the threads run this function (where start/end are the portions of the vector each thread should process, ie...the vector divided evenly into [number of threads] sections):
void threadFn(vector<foo> &arr, int startInx, int endInx)
{
for (int i = startInx; i < endInx; i++)
{
foo &f = arr[i];
f.inc();
}
}
If I run this function on only 1 thread, it returns in ~750ms (which excludes the time it took to construct the vector). If I run this function on all 16 threads, it returns in ~100ms. So I'm getting 7.5x the speed, which is nice...but I want to know if there's anything I can do to push it further.
I've read a bit about false sharing, but am not exactly sure how to translate that problem into a practical optimization here in order to avoid it.
Basically I'm looking for any ideas that I can use to help to scale the speed of this algorithm up better with the number of cores used. Or is that not possible?
It might be worth noting beforehand, that my thread pool implementation is not part of the bottleneck. If I strip the loop out of the threadFn, the pool finishes its work in <1ms.
fooclass altogether and replace it with a mereint? You can also simply doarr[i].inc(). This may make the code more easily readable. - ForceBru