5
votes

What is the difference between a host processor and coprocessor? Specifically Xeon Phi coprocessor and Xeon Phi host processor?

I have some performance results on these machines (a parallelized OpenMP code of diffusion equation was being run) which shows that the host processor works much faster when the same number of threads are working. I would like to know differences and relate them to my results.

2
What is exact model of Phi in your machines? Do you ask about execution modes (models) - software.intel.com/en-us/articles/… - they are named "Offload" / "Coprocessor native" / "Symmetric"? Cores of host CPU (not Phi, but some standard Xeon E3/E5) are usually faster than Phi cores on scalar code; but Phi has lot of cores and they are capable of executing vectorized code. - osgx
There are no Xeon Phi host processors yet. You have a Xeon host and a Xeon Phi coprocessor. The performance asymmetry for the same number of threads is easily understood if you read the published material on Xeon Phi. There's a few books on this you might want to find online. - Jeff Hammond
@osgx The model is: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz - It seems the runs were related to execution mode. I know the coprocessor run was as coprocessor native execution mode, but I'm not sure about host processor case. Do you think it should be offload mode? - Amir
@Jeff I found this document: link - It looks as you mentioned Xeon Phi coprocessor are slower but can be used in larger numbers, right? So what's the reason? Older technology? - Amir
Xeon Phi is based upon Pentium core from 1995 (P54C). It lacks the monster reorder buffer and prefetch capability of modern Xeon cores. In addition, it is single-issue per thread, dual-issue per core (Xeon is something like six-issue now) and runs at a low frequency relative to modern Xeon cores. However, since they are smaller cores running at a lower frequency, one can pack many more into a single die, hence the aggregate performance will be higher for highly concurrent workloads. Plus Xeon Phi is 512b SIMD, which Xeon won't have until Skylake. - Jeff Hammond

2 Answers

5
votes

Just to re-iterate what Jeff said in the comments, you have a Xeon host with an attached Xeon Phi coprocessor. The current generation of Xeon Phi (Knight's Corner) is only available as a coprocessor, not as a standalone Xeon Phi host (which should be available next generation with Knight's Landing).

When you run your program without offloading from your host Xeon, from this website, it looks like you'll be able to run with up to 16 threads. Note that the speed of each of your cores is about 2.2 GHz.

When you run your program in native execution mode on your Xeon Phi coprocessor, you should be able to run with a lot more threads. The optimal number of threads to use depends on the model of Xeon Phi you have (some work best with 56, others with 60). But note that each Xeon Phi core (roughly 1.2 GHz) is noticeably weaker than a single Xeon core (roughly 2.2 GHz). The benefit of the many-core Xeon Phi technology is exactly that: you can run across many cores.

The last very important thing to consider is that the Xeon Phi has a 512-bit wide SIMD instruction set. Thus, you can support much better SIMD vectorization running on the Xeon Phi coprocessor than on the host. In your case, I believe your Xeon host only has a 256-bit SIMD vector processing unit. Therefore, if you haven't already, you can improve your performance (up to x16 if you're dealing in single-precision) on your Xeon Phi taking advantage of SIMD vectorization. Your Xeon host will only give up to x8 performance. Just to start you on a google trek, OpenMP 4.0 allows you to write things like #pragma omp simd in order to tell the compiler when to vectorize lower-level loops throughout your code. If you really want maximum performance from the Xeon Phi, adding SIMD vectorization is a necessity.

So to directly answer your question: comparing the performance results between your Xeon host and Xeon Phi coprocessor using the same number of cores is useless. We already know that each Xeon Phi core is slower than each Xeon core. You should be comparing the results using the maximum number of cores each allows (60, and 16 respectively) and taking maximum advantage of the vector processing unit if you want a direct comparison.

1
votes

If you are talking about the current generation (KNC) and not the next (KNL), these are the definitions.

Host processor: The ~8 core/ ~16 thread Xeon that is hosting the coprocessor, meaning the Xeon host off of which the coprocessor is connected via the PCIe bus.

Coprocessor: The ~60 core/~240 thread coprocessor that is hanging off of your Xeon host on the Xeon's PCIe bus.

The host farms off highly parallel / vectorizeable jobs to the coprocessor using either offload instructions or by running them natively using some distributed programming paradigm such as MPI.

As to the comment about the next generation host processor, the commenter is referring to the fact that the next generation Xeon Phi (KNL) can be configured either as a coprocessor hanging off the PCIe bus (like the 1st gen Xeon Phi, KNC) or as a normal processor that you plug into a motherboard.