0
votes

I'm trying to find out whether I could use an Intel Xeon Phi coprocessor to "parallelize" the following problem:

Say I have 2000 files that need to be processed by a single-threaded executable. For each file, the executable reads it, does its thing and outputs it to a correspoinding output file, then exits.

For instance:

FILES=/path/to/*
for f in $FILES
do
    # take action on each file
    ./executable $f outFileCorrespondingTo_f
done

The tools are not coded for multi-threaded execution, or looping through the files, nor do we wish to change anything in their code for now. They're written in C with some external libraries.

My questions are:

  1. Could this kind of "script-looping" be run on the Xeon Phi's native OS in such a way that it parallelizes the calls to the executable, so they run concurrently on all of its cores? Is it "general-purpose" enough for that?

  2. The files themselves are rather small, so its 8GB memory would be more than enough for storing the data at runtime, but not for keeping all of the output on the device, so I would need to output on the host. So my second quetion is: is this kind of memory exchange possible "externally"?

i.e. not coded into the tool, but managed by the host OS and the device, for every execution of the executable.

  1. If this is possible, could it provide a performance boost in any way, or would the memory and thread allocation bottlenecks be too intensive? Basically each execution takes a few seconds, depending on the length of the input file, but I'm pretty confident this is a few orders of magnitude longer than how much it would take to transfer the file.
2
Regarding performance running multiple processes concurrently is likely to cause a lot of L2 cache contention/thrashing. Good L2 usage is usually very important for getting good performance on KNC. It depends on the workload though so YMMV.amckinley

2 Answers

0
votes

Xeon phi co-processors run a very feature-full version of the Linux operating system, so most of what you are used to on a Linux box is likely to work on Xeon Phi as well.

Now, for your specific issue, I guess that GNU Parallel should just permit you to do what you want in a breath. Simply, you'll have to have your file system mounted on the card so that you can access the files directly, but this is just standard stuff for a Xeon Phi node. And be aware that this will generate some traffic on the PCIe link between the host and the co-processor for the file transfers.

Regarding performance, this is hard to tell: the lower single-threaded performance of Xeon Phi cores along with the transfer times are definitely suggesting a big hit in this domain, but the level of parallelism you can extract from the device might very well overcome this, depending on how compute intensive your workload is. Best answer is for you to give it a try...

0
votes

This is an addition to the answer given by Gilles.

Yes, the Xeon Phi should be able to do what you want at a basic operational level.

Even so, I think it is the wrong platform for your purpose for a few reasons.

  • Each core on the Xeon Phi is a Pentium core. Though it is enhanced (4 threads/core, 512 bit vector engine, etc), it is still a Pentium. That means it runs scalar code as a Pentium. Your task sounds like a whole bunch of serial processes running in parallel. So each process will run as if it is running on a Pentium.
  • To achieve remarkable performance, you need code that parallelizes well (read that as OpenMP, light weight threads, and thread pooling) and also vectorizes (takes advantage of the 512-bit vector engine). Without both of those enhancements, you are running on a Pentium, abet a lot of Pentiums.
  • Moving data across the PCIe bus is slow. If you are transferring a lot of files, this can be even slower though you can reduce the contention a little by hiding latency (depending upon your application). If you are hitting the PCIe bus with 244 file read requests on start up, that's quite a lot of contention. Even in a steady state condition, it sounds like you'll be reading more than 20 files at any given time (and I suspect even more given that we are executing scalar code as a Pentium).

Now the KNL architecture might be more appropriate for your needs, but that isn't out yet.

If you still think the Xeon Phi might be appropriate for what you want to do, you can ask the Xeon Phi Intel forum experts. If your application is proprietary/sensitive, you can ask the Intel experts as a private message.