3
votes

I just finished installing a desktop computer based on an AMD Ryzen 2700x and 32GB RAM (running Ubuntu 18.04). At work, I have a 3-year-old laptop workstation with an Intel i7-6820HQ and 16GB RAM (running Windows 10).

I installed Anaconda on both platforms and ran a custom Python code which relies heavily on basic numpy matrix operations. The code does not involve any GPU-specific computation (my work laptop does not have any). The Ryzen is running at 3.7GHz, the laptop i7 is running at 3.6GHz. Both systems have been fully updated.

To my surprise, the code runs in 5 minutes on my work laptop, while it requires 10 minutes on the Ryzen desktop!

The latest Ryzen 2700x is supposed to be much faster than a high-end 3-year-old laptop Intel processor, then why would it be 2x slower?

  • Is it due to Ubuntu being sub-optimal in some way as opposed to Windows 10 for the Ryzen?

  • Is it due to Intel being more adequate to Python simulations than AMD?

  • Anything else?

Thanks for your help in understanding what is going on.

2
This arguably does not belong on stackoverflow. Try superuser (another stackexchange site). In any case, benchmarking is impossible without code against which to benchmark. If you want help, you'll need to provide a reproducible example. - Him
Thank you for your reply. I will try to post to superuser: once done, the post on stackoverflow will be suppressed to avoid creating a duplicate. I cannot share the code as I use it for my work, but I'll try to find the time to create a simpler test script for sharing. - PythonistL
@Scott if the answer is dependent on the code you've written, I'd argue it belongs here. It's only unfortunate that the code isn't shareable, a simple benchmark test that shows the difference would be very helpful. The lack of code is the only knock I have against this question, and the single answer that exists as I write this is illuminating. - Mark Ransom
@MarkRansom Does it depend on the code? The top answer suggests that this is actually a hardware question. See this meta - Him
@Scott Normally Ryzen and i7 are pretty evenly matched, it requires very specific code to produce a 2x difference. - Mark Ransom

2 Answers

10
votes

It's a software issue: by default, anaconda comes with intel's MKL as the backend for BLAS, which will purposefully cripple AMD speed. You can also install the non-MKL version, which uses openBLAS instead, and you'll see a huge performance boost. You don't need to reinstall it, just uninstall numpy and mkl, then install a numpy built with openBLAS.

4
votes

numpy matrix operations

Intel Skylake has significantly better FMA throughput (2 per clock 256-bit vector) than Ryzen (2 per clock 128-bit vector or 1 per clock 256-bit vector). See https://agner.org/optimize/ for x86 microarch details. And FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2 for a summary including Ryzen.

With data hot in cache, which a well-optimized matmul can achieve with cache-blocking, a good matmul can bottleneck on FMA execution unit throughput.

Or L1d SIMD load/store bandwidth, where Skylake > 2x Ryzen, being able to sustain close to 2x 256-bit load + 1x 256-bit store, while Ryzen can sustain 2x 128-bit cache accesses, up to one of which can be a store.

So it's totally reasonable for the single-threaded or per-core throughput for Intel to be twice that of a Ryzen core, for matmul / FMA throughput.


Are you multi-threading to take advantage of all cores in each machine? 2700x is an 8-core CPU, while 6820HQ is a 4-core chip.

If your workload can / is taking advantage of multiple cores, then maybe it's an L3 cache bandwidth limitation that's making the difference, assuming they're both configured correctly and actually running at 3.6 / 3.7 GHz. Or maybe there's something creating a 4x per-core perf difference.