1
votes

For some time I'm using sparklyr package to connect to companys Hadoop cluster using the code:

library(sparklyr)

Sys.setenv(SPARK_HOME="/opt/spark/")
Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop/conf.cloudera.yarn")
Sys.setenv(JAVA_HOME="/usr/lib/jvm/jre")

system('kinit -k -t user.keytab user@xyz')

sc <- spark_connect(master="yarn",
                config = list(
                  default = list(
                    spark.submit.deployMode= "client",
                    spark.yarn.keytab= "user.keytab",
                    spark.yarn.principal= "user@xyz",
                    spark.executor.instances= 20, 
                    spark.executor.memory= "4G",
                    spark.executor.cores= 4,
                    spark.driver.memory= "8G")))

and everything works fine, but when I'm trying to add rsparkling package using similar code:

library(h2o)
library(rsparkling)
library(sparklyr)

options(rsparkling.sparklingwater.version = '2.0')

Sys.setenv(SPARK_HOME="/opt/spark/")
Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop/conf.cloudera.yarn")
Sys.setenv(JAVA_HOME="/usr/lib/jvm/jre")

system('kinit -k -t user.keytab user@xyz')

sc <- spark_connect(master="yarn",
                config = list(
                  default = list(
                    spark.submit.deployMode= "client",
                    spark.yarn.keytab= "user.keytab",
                    spark.yarn.principal= "user@xyz",
                    spark.executor.instances= 20, 
                    spark.executor.memory= "4G",
                    spark.executor.cores= 4,
                    spark.driver.memory= "8G")))

I'm getting error:

Error in force(code) :
Failed while connecting to sparklyr to port (8880) for sessionid (9819): Sparklyr gateway did not respond while retrieving ports information after 60 seconds Path: /opt/spark-2.0.0-bin-hadoop2.6/bin/spark-submit Parameters: --class, sparklyr.Backend, --packages, 'ai.h2o:sparkling-water-core_2.11:2.0','ai.h2o:sparkling-water-ml_2.11:2.0','ai.h2o:sparkling-water-repl_2.11:2.0', '/usr/lib64/R/library/sparklyr/java/sparklyr-2.0-2.11.jar', 8880, 9819

---- Output Log ----
Ivy Default Cache set to: /opt/users/user/.ivy2/cache The jars for the packages stored in: /opt/users/user/.ivy2/jars :: loading settings :: url = jar:file:/opt/spark-2.0.0-bin-hadoop2.6/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml ai.h2o#sparkling-water-core_2.11 added as a dependency ai.h2o#sparkling-water-ml_2.11 added as a dependency ai.h2o#sparkling-water-repl_2.11 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default]

---- Error Log ----
In addition: Warning messages: 1: In if (nchar(config[[e]]) == 0) found <- FALSE : the condition has length 1 and only the first element will be used 2: In if (nchar(config[[e]]) == 0) found <- FALSE : the condition has length 1 and only the first element will be used

I'm new to spark and clusters and not really sure what to do now. Any help will be very appreciated. My first thought were missing jar files for sparkling water on the cluster side, am I right?

1

1 Answers

1
votes

You need to use exact version number of Sparkling Water:

options(rsparkling.sparklingwater.version = '2.0.5')

Or you can download binary version of Sparkling Version directly from http://h2o.ai/download, unzip it and replace statement above by:

options(rsparkling.sparklingwater.location = "/tmp/sparkling-water-assembly_2.11-2.0.99999-SNAPSHOT-all.jar")