1
votes

I'm using OpenMP libraries for parallel computing. I use C++ vectors, whose size is usually in the order of 1*10^5. While going through iteration process, I need to re-initialize a bunch of these large vectors(not thread private but global scope) to a initial value. which is the faster way to do this?, using #pragma omp for or #pragma omp single?

2

2 Answers

0
votes

The general answer would need to be "it depends, you have to measure" since initialization in C++ can be, depending on the type, trivial or very expensive. You did not provide an awful lot of detail, so one has to guess a bit.
If a class has a computionally expensive constructor, parallelizing work may very well be worth it.

Your specific wording "initialize to value" suggests that your vector holds POD (say, for example, integers?). I will assume that this is the case.

Assuming this, parallelizing will almost certainly not be any faster. This operation is bound by memory bandwidth, and one CPU thread should be able to saturate memory bandwidth to approximately 99%.

Parallelizing may however very well be slower, due to several reasons (which I'm not going to elaborate, enough being said that it's unlikely to be faster).

1
votes

Assuming simple initialization of primitive datatypes, the initialization itself will be bound by memory or cache bandwidth. However, on modern systems you must use multiple threads to fully utilize both your memory and cache bandwidth. For example take a look at these benchmark results, where the first two rows compare parallel versus single threaded cache, and the last two rows parallel vs. single threaded main memory bandwidth. On high-performance oriented system, especially with multiple sockets, more threads are very important to exploit the available bandwidth.

However, the performance of the re-initialization is not the only thing you should care about. Assuming for instance double precision floating point numbers, 10e5 elements equal to 800 kb memory, which fits into caches. To improve overall performance, you should try to ensure that after initialization the data is in a cache close to the core that later accesses the data. In a NUMA system (multiple sockets with faster memory access to their local memory), this is even more important.

If you do initialize shared memory concurrently, make sure to not write the same cache line from different cores, and try to keep the access pattern regular to not confuse prefetchers and other clever magic of the CPU.

The general recommendation is: Start with a simple implementation and later analyze your application to understand where the bottleneck actually is. Do not invest in complex, hard to maintain, system specific optimizations that may only affect a tiny faction of your codes overall runtime. If it turns out this is a bottleneck for your application, and your hardware resources are not utilized well, then you need to understand the performance characteristics of your underlying hardware (local/shared caches, NUMA, prefetchers) and tune your code accordingly.