Let S(U, N) denotes the speedup obtained on system U where the baseline program (the numerator in the speedup formula) uses 1 thread and the improved program uses N threads. That is:
S(U, N) = TimeU(1) / TimeU(N)
Hence:
S(Xeon, 8) > S(Ryzen, 8)
This implies that:
TimeXeon(1) / TimeXeon(8) > TimeRyzen(1) / TimeRyzen(8)
But we cannot conclude anything about how any two execution times are related. We can only say that Xeon has scaled better (i.e., the program was able to make more effective use of the additional resources on Xeon than on Ryzen), but that doesn't mean that it has performed better in terms of execution time. It's just a mathematically invalid conclusion. For example, we can't conclude that TimeXeon(8) > TimeRyzen(8).
However, we can observe that:
S(Xeon, 8) > S(Xeon, 4)
That is:
TimeXeon(1) / TimeXeon(8) > TimeXeon(1) / TimeXeon(4)
The two TimeXeon(1) terms cancel each other and we get:
TimeXeon(4) > TimeXeon(8)
Now here comes the critical observation. Why were we able to deduce from two given speedups how two execution times are related on the same CPU but on on two different CPUs? Because on the same CPU, the baseline is the same in both speedups, which enabled us to cancel them with each other.
So how can we make the same deduction on two different CPUs? By using a shared baseline or reference system. Typically, some old system is chosen as the baseline. For example, you can choose here Willamette, which is a Pentium 4 processor released in 2000. Of course, you need to choose a system that you can run experiments on to measure the baseline execution time. So the speedup can then be calculated as follows:
Sref(U, N) = TimeWillamette(1) / TimeU(N)
Essentially, TimeWillamette(1) becomes the shared term. This formula is much more useful than the previous one. For example, you can easily calculate S(U, N) given only Sref(U, N) as follows:
S(U, N) = Sref(U, N) / Sref(U, 1)
So if Sref(Xeon, 8) > Sref(Xeon, 4), then it's mathematically valid to deduce that TimeXeon(8) < TimeXeon(4). Also if Sref(Xeon, 8) > Sref(Ryzen, 8), then it's mathematically valid to deduce that TimeXeon(8) < TimeRyzen(8). A given relation between two Sref(U, N) speedups on the same or different CPUs contains more information compared to using S(U, N).
The SPEC CPU benchmark suite uses this method to normalize performance metrics. The SPEC CPU 2006 suite uses a machine from 1997:
SPEC uses a historical Sun system, the "Ultra Enterprise 2" which was
introduced in 1997, as the reference machine. The reference machine
uses a 296 MHz UltraSPARC II processor, as did the reference machine
for CPU2000. But the reference machines for the two suites are not
identical: the CPU2006 reference machine has substantially better
caches, and the CPU2000 reference machine could not have held enough
memory to run CPU2006.
The SPEC CPU 2017 uses a more modern machine from 2006:
The reference machine is a historical Sun Microsystems server, the Sun
Fire V490 with 2100 MHz UltraSPARC-IV+ chips. The UltraSPARC-IV+ was
introduced in 2006, and is newer than the chip used in the CPU2000 and
CPU2006 reference machines (the 300 MHz 1997 UltraSPARC II).
The normalized numbers can be compared against each other whether they are from the same system or different systems.
So the reference system should be the most modern system that is older (in particular, slower) than all the systems of interest (i.e., those that may be compared against each other).