Performance analysis of CPUs for parallel executions

Question

Recently, I did a "parallel speedup" comparison between two computers with different specifications

1- Single AMD Ryzen 7 1800x running at 3.6GHz. This cpu has 8 physical cores and 16 logical cores (here).

2- Dual Xeon 2695 v3 (Haswell) running at 2.3GHz on one motherboard. Each cpu has 14 physical cores. So, total physical cores are 28 and total logical cores are 56 (here).

I ran one program with different number of threads on both systems. I know that this may not be a fair comparison since the program also uses about 4GB of memory and I didn't tell about memory specs, but the speedup chart is shown below. Note that for each processor,

speedup = (time of one thread on that CPU) / (time of N threads on that CPU)

Therefore, for 1 thread, both Ryzen and Xeon are scaled to 1.

If someone looks at the chart, he may say that the speedup of Xeon is better than Ryzen. For example, with 8 cores, Ryzen has 3.4 speedup while Xeon has 4.69 speedup.

However, if we check the time data, we will see that for 8 threads they have the same execution time. Moreover, Ryzen performs better that Xeon. It is obvious that

S_ryzen = 900/263        <        S_xeon = 1188/253

So, looking at only speedup data, sound misguiding. On the other hand, I do expect that 8 thread ryzen should have less execution time than Xeon, e.g. 200 seconds since it has better single core performance.

What can be concluded about the performance comparison of these two processors? I know the xeon provides more cores, but taking 8 cores (which both have), which processor has higher performance?

How did you measure the thread time? What OS? What does the (apparently not very scalable) program do? Have you tried another program? — rustyx

Hadi Brais Hadi Brais · Accepted Answer · 2019-01-11T20:23:01

Let S(U, N) denotes the speedup obtained on system U where the baseline program (the numerator in the speedup formula) uses 1 thread and the improved program uses N threads. That is:

S(U, N) = Time_U(1) / Time_U(N)

Hence:

S(Xeon, 8) > S(Ryzen, 8)

This implies that:

Time_Xeon(1) / Time_Xeon(8) > Time_Ryzen(1) / Time_Ryzen(8)

But we cannot conclude anything about how any two execution times are related. We can only say that Xeon has scaled better (i.e., the program was able to make more effective use of the additional resources on Xeon than on Ryzen), but that doesn't mean that it has performed better in terms of execution time. It's just a mathematically invalid conclusion. For example, we can't conclude that Time_Xeon(8) > Time_Ryzen(8).

However, we can observe that:

S(Xeon, 8) > S(Xeon, 4)

That is:

Time_Xeon(1) / Time_Xeon(8) > Time_Xeon(1) / Time_Xeon(4)

The two Time_Xeon(1) terms cancel each other and we get:

Time_Xeon(4) > Time_Xeon(8)

Now here comes the critical observation. Why were we able to deduce from two given speedups how two execution times are related on the same CPU but on on two different CPUs? Because on the same CPU, the baseline is the same in both speedups, which enabled us to cancel them with each other.

So how can we make the same deduction on two different CPUs? By using a shared baseline or reference system. Typically, some old system is chosen as the baseline. For example, you can choose here Willamette, which is a Pentium 4 processor released in 2000. Of course, you need to choose a system that you can run experiments on to measure the baseline execution time. So the speedup can then be calculated as follows:

S_ref(U, N) = Time_Willamette(1) / Time_U(N)

Essentially, Time_Willamette(1) becomes the shared term. This formula is much more useful than the previous one. For example, you can easily calculate S(U, N) given only S_ref(U, N) as follows:

S(U, N) = S_ref(U, N) / S_ref(U, 1)

So if S_ref(Xeon, 8) > S_ref(Xeon, 4), then it's mathematically valid to deduce that Time_Xeon(8) < Time_Xeon(4). Also if S_ref(Xeon, 8) > S_ref(Ryzen, 8), then it's mathematically valid to deduce that Time_Xeon(8) < Time_Ryzen(8). A given relation between two S_ref(U, N) speedups on the same or different CPUs contains more information compared to using S(U, N).

The SPEC CPU benchmark suite uses this method to normalize performance metrics. The SPEC CPU 2006 suite uses a machine from 1997:

SPEC uses a historical Sun system, the "Ultra Enterprise 2" which was introduced in 1997, as the reference machine. The reference machine uses a 296 MHz UltraSPARC II processor, as did the reference machine for CPU2000. But the reference machines for the two suites are not identical: the CPU2006 reference machine has substantially better caches, and the CPU2000 reference machine could not have held enough memory to run CPU2006.

The SPEC CPU 2017 uses a more modern machine from 2006:

The reference machine is a historical Sun Microsystems server, the Sun Fire V490 with 2100 MHz UltraSPARC-IV+ chips. The UltraSPARC-IV+ was introduced in 2006, and is newer than the chip used in the CPU2000 and CPU2006 reference machines (the 300 MHz 1997 UltraSPARC II).

The normalized numbers can be compared against each other whether they are from the same system or different systems.

So the reference system should be the most modern system that is older (in particular, slower) than all the systems of interest (i.e., those that may be compared against each other).

Performance analysis of CPUs for parallel executions

1 Answers