0
votes

I am having difficulty loading packages into R on my compute pool nodes using the Azure Batch Python API. The code that I am using is similar to what is provided in the Azure Batch Python SDK Tutorial, except the task is more complicated -- I want each node in the job pool to execute an R script which requires certain package dependencies.

Hence, in my start task commands below, I have each node (Canonical UbuntuServer SKU: 16) install R via apt and install R package dependencies (the reason why I added R package installation to the start task is that, even after creating a lib directory ~/Rpkgs with universal permissions, running install.packages(list_of_packages, lib="~/Rpkgs/", repos="http://cran.r-project.org") in the task script leads to "not writable" errors.)

task_commands = [
    'cp -p {} $AZ_BATCH_NODE_SHARED_DIR'.format(_R_TASK_SCRIPT),
    # Install pip
    'curl -fSsL https://bootstrap.pypa.io/get-pip.py | python',
    # Install the azure-storage module so that the task script can access Azure Blob storage, pre-cryptography version
    'pip install azure-storage==0.32.0',
    # Install R
    'sudo apt -y install r-base-core',
    'mkdir ~/Rpkgs/',
    'sudo chown _azbatch:_azbatchgrp ~/Rpkgs/',
    'sudo chmod 777 ~/Rpkgs/',
    # Install R package dependencies
    # *NOTE*: the double escape below is necessary because Azure strips the forward slash
    'printf "install.packages( c(\\"foreach\\", \\"iterators\\", \\"optparse\\", \\"glmnet\\", \\"doMC\\"), lib=\\"~/Rpkgs/\\", repos=\\"https://cran.cnr.berkeley.edu\\")\n" > ~/startTask.txt',
    'R < startTask.txt --no-save'
    ]

Anyhow, I confirmed in the Azure portal that these packages installed as intended on the compute pool nodes (you can see them located at startup/wd/Rpkgs/, a.k.a. ~/Rpkgs/, in the node filesystem). However, while the _R_TASK_SCRIPT task was successfully added to the job pool, it terminated with a non-zero exit code because it wasn't able to load any of the packages (e.g. foreach, iterators, optparse, etc.) that had been installed in the start task.

More specifically, the _R_TASK_SCRIPT contained the following R code and returned the following output:

R code:

lapply( c("iterators", "foreach", "optparse", "glmnet", "doMC"), require, character.only=TRUE, lib.loc="~/Rpkgs/")
...

R stderr, stderr.txt on Azure Batch node:

Loading required package: iterators
Loading required package: foreach
Loading required package: optparse
Loading required package: glmnet
Loading required package: doMC

R stdout, stdout.txt on Azure Batch node:

[[1]]
[1] FALSE

[[2]]
[1] FALSE

[[3]]
[1] FALSE

[[4]]
[1] FALSE

[[5]]
[1] FALSE

FALSE above indicates that it was not able to load the R package. This is the issue I'm facing, and I'd like to figure out why.

It may be noteworthy that, when I spin up a comparable VM (Canonical UbuntuServer SKU: 16) and run the same installation manually, it successfully loads all packages.

myusername@rnode:~$ pwd
/home/myusername
myusername@rnode:~$ mkdir ~/Rpkgs/
myusername@rnode:~$ printf "install.packages( c(\"foreach\", \"iterators\", \"optparse\", \"glmnet\", \"doMC\"), lib=\"~/Rpkgs/\", repos=\"http://cran.r-project.org\")\n" > ~/startTask.txt
myusername@rnode:~$ R < startTask.txt --no-save
myusername@rnode:~$ R

R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
...
> lapply( c("iterators", "foreach", "optparse", "glmnet", "doMC"), require, character.only=TRUE, lib.loc="~/Rpkgs/")
Loading required package: iterators
Loading required package: foreach
...
Loading required package: optparse
Loading required package: glmnet
Loading required package: Matrix
Loaded glmnet 2.0-10

Loading required package: doMC
Loading required package: parallel
[[1]]
[1] TRUE

[[2]]
[1] TRUE

[[3]]
[1] TRUE

[[4]]
[1] TRUE

[[5]]
[1] TRUE

Thanks in advance for your help and suggestions.

2

2 Answers

0
votes

Each task runs on its own working directory which is referenced by the environment variable, $AZ_BATCH_TASK_WORKING_DIR. When the R session runs, the current R working directory [ getwd() ] will be $AZ_BATCH_TASK_WORKING_DIR, not $AZ_BATCH_NODE_STARTUP_DIR where the pkgs lives.

To get the exact package location ("startup/wd/pkgs") in the R code,

lapply( c("iterators", "foreach", "optparse", "glmnet", "doMC"), require, 
character.only=TRUE, lib.loc=paste0(Sys.getenv("AZ_BATCH_NODE_STARTUP_DIR"), 
 "/wd/", "Rpkgs") )

or

Run this method before the lapply:

setwd(paste0(Sys.getenv("AZ_BATCH_NODE_STARTUP_DIR"), "/wd/"))

Added: You can also create a Batch pool of Azure data scientist virtual machines that has R already installed so you don't have to install it yourself.

Azure Batch has the doAzureParallel R package supports package installation. Here's a link: https://github.com/Azure/doAzureParallel (Disclaimer: I created the doAzureParallel R package)

0
votes

It seems to be caused by the installed packages not exists the default library paths for R. Try to set the path of library trees within which packages are looked for via add the code .libPath("~\Rpkgs") before load packages.

As reference, there is a SO thread Changing R default library path using .libPaths in Rprofile.site fails to work which you can refer to.

Meanwhile, an offical blog introduces how to use R workload on Azure Batch, but for Windows environment. Hope it helps.