1
votes

From this https://software.intel.com/en-us/videos/purpose-of-the-mic-architecture I understand that applications with complex or numerous random memory access are not well suited for Intel Xeon-phi. This is because the architecture uses 61 cores and 8 memory controllers. In case of L1 and L2 cache misses, it takes up to 100s cycles to fetch the line into memory and get it ready for use by the CPU. Such applications are called latency-bound.

Then, the tutorial mentions that many-core architecture (Xeon-phi coprocessor only) are well suited for highly parallel homogeneous code. Two questions from there:

  1. What is referred to as homogeneous code ?

  2. What are real-world applications which can fully benefit from MIC architecture ?

2
This is too broad of a question for this forum.Jeff Hammond

2 Answers

2
votes

I see the Intel MIC architecture as a "x86 based GPGPU" and if you are familiar with the concept of GPGPU you will find your self familiar with the Intel MIC.


An homogeneous clustering is a system infrastructure with multiple execution unit (i.e. CPUs) all with the same features. For example a multicore system that have four Intel Xeon processors is homogeneous.

An heterogeneous clustering is a system infrastructure with multiple execution unit with different features (i.e. like CPU and GPU). For example my Levono z510 with its Intel i7 Haswell (4 CPUs), its Nvidia GT740M (GPU) and its Intel HD Graphics 4600 (GPU) is an heterogeneous system.

An example of heterogeneous code could be a Video Game.
A video game has a control code, executed by one code of one CPU, that control what the other agents do, its send shaders to execute on the GPUs, physic computation to be performed on others cores or GPUs and so on.
In this example you need to write code that run on the CPU (so it is "CPU aware") and code that run on GPU (so it is "GPU aware"). This is actually done by using different tools, different programming languages and different programming models!

homogeneous code is code that don't need to be aware of n different programming models, one for each different kind of agent. It is just the same programming model, language and tool.
Take a look a this very simple sample code for the MPI library.
The code is all written in C, it is the same program that just take a different flow.


About the applications, Well that's really a broad question...

As said above I see the Intel MIC as a GPGPU based on x86 ISA (part of it at least).

An SDK particularly useful (and listed on the video you linked) to work with clustered systems is OpenCL, it can be used for fast processing of images and computer vision and basically for anything that need the same algorithm to be run billions of times with different inputs (like cryptography applications/brute forcing).
If you search for some OpenCL based project on the web you will get an idea.

To answer you second question it is better to ask ourselves "What could not take advantage of the MIC architecture?" and we will soon find that the more an algorithm is distant from the concept of Stream Processing and the related topics, including the one of Kernel, the less it is suitable for the MIC.

0
votes

First a straight forward answer to your direct question - to get the most out of the coprocessor, your code should be able to use a large number of threads and should vectorize. How many threads? Well, you have 60 cores (+/- depending on which version you get) and 4 threads per core, with a sweet spot around 2 threads per core on many codes. Sometimes you can get good performance even if you don't use every single core. But vectorization is extremely important; the long (512 byte) vectors are a big source of speed on the coprocessor.

Now, on to programming. The Intel Xeon Phi coprocessor uses two different kinds of programming - offload and native.

In the offload model, you write a program, determine which parts of that code have enough parallelism to make use of the large number of cores on the coprocessor and mark those sections with offload directives. Then inside those offloaded sections, you write the code using some form of parallelism, like OpenMP. (Heterogeneous)

In native code, you do not use any offload directives but, instead, use a -mmic compiler directive. Then you run the code directly on the coprocessor. The code you write will use some form of parallelism, like OpenMP, to make use of the large number of cores the coprocessor has. (Homogeneous)

Another variation on these programming models is to use MPI, often in addition to OpenMP. You can use the offload programming model, in which case, the nodes in you MPI system will be the host nodes in your system. (Hybrid) Alternately, you can use the native programming model, in which case you treat the coprocessor as just another node in your system. (Heterogeneous if host and coprocessors are nodes; homogeneous if only coprocessors are use.)

You may have noticed that nothing I have said implies a separate programming style for the host and coprocessor. There are some optimizations you can make that will keep code written for the coprocessor from running on the processor as well but, in general, the code you write for the coprocessor can also be compiled for and run on the host by just changing the compiler options.

As far as real world apps, see https://software.intel.com/en-us/mic-developer/app-catalogs