spark-cmd not found when trying to run Spark in Rstudio

Question

I am following the instructions found here on rbloggers to set up spark on a redhat machine. I want to use Spark in RStudio.

I have downloaded spark-1.6.1-bin-hadoop2.6 and followed the instructions as and put the following line in a script in RStudio:

# Setting SPARK_HOME
Sys.setenv(SPARK_HOME = "~/Downloads/spark-1.6.1-bin-hadoop2.6")

# Setting library path
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

library(SparkR)

# create a spark context
sc <- sparkR.init(master = "local")

But the last line returns the following error:

Launching java with spark-submit command ~/Downloads/spark-1.6.1-bin-hadoop2.6/bin/spark-submit sparkr-shell /tmp/RtmpSwsYUW/backend_port3752546940e6

sh: ~/Downloads/spark-1.6.1-bin-hadoop2.6/bin/spark-submit: No such file or directory

I have tried every solution on the internet before asking this. For example:

JAVA_HOME and SPARK_HOME are set.
Giving spark-submit executable by chmod a+x spark-submit.cmd (and also chmod u+w spark-submit.cmd) and did not work. (of course I was in the correct library)
Tried spark-shell in terminal and it works (returns a working shell ins scala)

Adding this before initialization:

  Sys.setenv("SPARK_SUBMIT_ARGS"=" - - master yarn-client sparkr-shell")

The only issue I can think of now, is that there is no sparkr-shell in the directory. It is just sparkr.cmd and sparkr2.cmd. Now I am wondering is it related to spark version that I downloaded? Should I install hadoop first?

Peyton Peyton · Accepted Answer · 2016-06-23T21:13:33

SparkR invokes Spark through system2, which quotes the command using shQuote (see ?system2 and ?shQuote). This means that the ~ doesn't get expanded.

Just specify the full path:

Sys.setenv(SPARK_HOME = "/home/<youruser>/Downloads/spark-1.6.1-bin-hadoop2.6")

Or do the path expansion yourself:

Sys.setenv(SPARK_HOME = path.expand("~/Downloads/spark-1.6.1-bin-hadoop2.6"))

The .cmd files are for Windows, by the way, so they're not relevant.

spark-cmd not found when trying to run Spark in Rstudio

1 Answers