Faster CPU does not increase performance while increasing number of threads does. Best shared-memory computers in industry?

Question

I currently have a program that benefits greatly from multithreading. It starts n threads, each thread does 100M iterations. They all use shared memory but there is no synchronization at all. It approximates some equation solutions and current benchmarks are:

1 thread:  precision 1 time: 150s
4 threads: precision 4 time: 150s

16 threads: precision 16 time: 150s
32 threads: precision 32 time: 210s
64 threads: precision 64 time: 420s

(Higher precision is better)

I use Amazon EC2 'Cluster Compute Eight Extra Large Instance' which has 2 x Intel Xeon E5-2670 As far as I understand, it has 16 real cores, thus program has linear improvement up to 16 cores. Also it has 2x 'hyper-threading' and my program gains somewhat from this. Making number of threads more than 32 is obviously gives no improvement.

These benchmarks prove that access to RAM is not 'bottleneck'.

Also I ran the program on Intel Xeon E5645 which has 12 real cores. Results are:

1 thread: precision 1 time: 150s
4 threads: precision 4 time 150s
12 threads: precision 12 time 150s
24 threads: precision 24 time 220s

precision/(time*thread#) is similar to Amazon computer, which is not clear for me, because each core in Xeon E5-2670 is ~1.5 faster according to cpu MHz (~1600 vs ~2600) and http://www.cpubenchmark.net/cpu_list.php 'Passmark CPU Mark' number adjusted for

Why using faster processor does not improve single-threaded performance while increasing number of threads does?
Is it possible to rent some server that will have Multiple CPU more powerful than 2 x Intel Xeon E5-2670 while using the shared RAM, so I can run my program without any changes and get better results?

Update:

13 threads on Xeon5645 take 196 seconds.

Algorithm randomly explores tree which has 3500 nodes. Height of tree is 7. Each node contains 250 doubles which are also randomly accessed. It is very likely that almost no data is cached.

Need to see a complete but minimal example that reproduces the numbers to make any kind of informed comment. Raw MHz alone is a pretty naive performance metric though. — Flexo
Comparing a EC2 instance versus a dedicated computer is going to be misleading. There is virtualization overhead that you cannot take into account. You really have no idea about what the physical machine specs are and how many additional VMs are competing for system resources against you own. This really seems like you are comparing apples and engine-blocks to me. — Gray

jstine jstine · Accepted Answer · 2013-02-18T16:52:47

Specs on the two Intel CPUs you've listed:

E5-2670 - 2.6ghz minimum [8 active cores]  (3.3ghz turbo on a single core)
E5645   - 2.4ghz minimum [6 active cores]  (2.8ghz turbo on a single core)

So there is at least one important question to ask yourself here:

Why isn't your app faster as a single core? There is much more of a speed drop scaling up from 1 core to 8 cores on the E5-2670 than there is a speed drop switching to the E5645. You shouldn't notice a linear progression from 1 to 16 threads, even if your app has zero inter-thread locks -- all current-gen CPUs drop clock rate as more threads are added to their workload.

The answer is probably not RAM at least in a basic sense, but it might be "L1/L2 caches". The L1/L2 caches are much more important for application performance than RAM throughput. Modern Intel CPUs are designed around the idea that L1/L2 cache hit rates will likely be good (if not great). If the L1/L2 caches are rendered useless by an algorithm that's churning through megabytes of memory without some frequent reuse pattern, then the CPU will essentially become bottlenecked against the RAM latency.

RAM Latency is Not RAM Throughput

While the throughput of the ram is probably plenty enough to keep up with all your threads over time, the latency is not. Latency reading from RAM is 80-120 cycles, depending on CPU clock multiplier. By comparison, latency reading from L1 is 3 cycles, from L2 11-12 cycles. Therefore, if some portion of your algorithm is always resulting in a fetch from RAM, then that portion will always take a very long time to execute, and approx the same time on different CPUs since the ram latency will be about the same. 100 cycles on a Xeon is long enough that even a single stall against RAM can become the dominant hot-spot in an algo (consider that these chips avg 3 instructions per cycle).

I do not know if this is the actual bottleneck on your application, since I don't know how much data it processes on each iteration, or what access ram patterns it uses. But this is one of the only explanations for having a constant-time algo across many configurations of thread, and across different Xeon CPUs.

(Edit: There's also a shared L3 cache on these Xeon chips, but its helpfulness is pretty limited. The latency on L3 accesses is 50-60 cycles -- better than RAM but not by much. And the chance to hit L3 is pretty slim if both L1/L2 are already ineffective. As mentioned before, these chips are designed with high L1/L2 hit rates in mind: The L3 cache arrangement is built in such a way to compliment occasional misses from L1/L2, and does not do well serving data as a primary cache itself)

Faster CPU does not increase performance while increasing number of threads does. Best shared-memory computers in industry?

2 Answers