I am following the instructions found here on rbloggers to set up spark on a redhat machine. I want to use Spark in RStudio.
I have downloaded spark-1.6.1-bin-hadoop2.6
and followed the instructions as and put the following line in a script in RStudio:
# Setting SPARK_HOME
Sys.setenv(SPARK_HOME = "~/Downloads/spark-1.6.1-bin-hadoop2.6")
# Setting library path
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
# create a spark context
sc <- sparkR.init(master = "local")
But the last line returns the following error:
Launching java with spark-submit command ~/Downloads/spark-1.6.1-bin-hadoop2.6/bin/spark-submit sparkr-shell /tmp/RtmpSwsYUW/backend_port3752546940e6
sh: ~/Downloads/spark-1.6.1-bin-hadoop2.6/bin/spark-submit: No such file or directory
I have tried every solution on the internet before asking this. For example:
JAVA_HOME and SPARK_HOME are set.
Giving
spark-submit
executable bychmod a+x spark-submit.cmd
(and alsochmod u+w spark-submit.cmd
) and did not work. (of course I was in the correct library)Tried
spark-shell
in terminal and it works (returns a working shell ins scala)Adding this before initialization:
Sys.setenv("SPARK_SUBMIT_ARGS"=" - - master yarn-client sparkr-shell")
The only issue I can think of now, is that there is no sparkr-shell
in the directory. It is just sparkr.cmd
and sparkr2.cmd
. Now I am wondering is it related to spark version that I downloaded? Should I install hadoop first?