R Parallel Processing with Xeon Phi, minimal code changes?

Question

Looking at buying a couple Xeon Phi 5110P, but trying to estimate how much code I have to change or other software needed.

Currently I make good use of R on a multi-core Windows machine (24 cores) by using the foreach package, passing it other packages forecast, glmnet, etc. to do my parallel processing.

Having a Xeon Phi I understand I would want to compile R https://software.intel.com/en-us/articles/running-r-with-support-for-intel-xeon-phi-coprocessors And I understand this could be done with a trail version of Parallel Studio XE.

Then do I then need to edit R's Makeconf file, adding the C/C++ flags and for the Phi? Compile all the needed packages before the trail on Parallel Studio expires? Or do I not need to edit the Makeconf to get the benefits of foreach on the Phi?

Seems like some of this will be handled automatically once R is compiled, with offloading done by the Math Kernel Library (MKL), but I'm not totally sure of this.

Somewhat related question: Is the Intel Xeon Phi usable without a costly Intel Compiler?

Also revolutionanalytics.com seems to have a few related blog posts but not entirely conclusive for me: http://blog.revolutionanalytics.com/2015/05/behold-the-power-of-parallel.html

Do you have any updates on progress Zachary? I've been allocated project time on a cluster with Xeon Phi co-processors (R has been installed right) but I'm struggling to work out how to proceed. I've also been using forecast and foreach too loop through thousands of time series and perform rolling origin forecasting for a range of methods. My uncertainty is around whether to use native mode or offloading, the individual calculations within the inner loop are not especially heavy so I don't know whether I can properly parallelize over the 200+ threads: can R run in each thread? — Matt Weller
Hi Matt. No solution, internal delays on getting hardware. I did try to reach out to Rich Calaway, one of the authors of foreach who hasn't tried it, but relayed that another person had by setting up the MKL environment variables appropriately, letting blas and lapack do automatic offloading. Doing native mode, I would wonder about the overhead in transferring all the info to the Phi and possible memory limits if they are particularly large data sets. If you get a chance, let me know how things work out. — Zachary
Struggling here @Zachary it's not my box so the sysadmin has installed R (following the Intel directions) but installing the forecast package to a local library with install.packages fails during compilation. I'm tempted to just go for it on an AWS box with RRO! Keep me posted how you get on. — Matt Weller

terry leitch terry leitch · Accepted Answer · 2016-12-29T17:13:33

If all you need is matrix operations, you can compile it with MKL libraries per here: [Running R with Support for Intel® Xeon Phi™ Coprocessors][1] , which requires the Intel Complier. Microsoft R comes pre compiled with MKL but I was not able to use the auto offload, I had to compile R with the Intel compiler for it to work properly.

You could use the trial version compiler and compile it during the trial period to see if it fits your purpose.

If you want to use things like foreach package by setting up a cluster,since each node is a linux computer, I'm afraid you're out of luck. On page 3 of [R-Admin][1] it says

Cross-building is not possible: installing R builds a minimal version of R and then runs many R scripts to complete the build.

You have to cross compile from xeon host for xeon phi node with the intel compiler, and it's just not feasible.

The last way to utilize the Phi is to rewrite your code to call it directly. Rcpp provides an easy interface to C and C++ routines. If you found a C routine that runs well on xeon you can call the nodes within your code. I have done this with CUDA and the Rcpp is a thin layer and there are good examples of how to use it, and if you join it with examples of calling the phi card nodes you can probably achieve your goal with less overhead.

BUt, if all you need is matrix ops, there is no quicker route to supercomputing than a good double precision nvidea card and pre loading nvBlas during R startup.

R Parallel Processing with Xeon Phi, minimal code changes?

1 Answers