0
votes

this is real frustrating. I have spent several days looking at all the issues found here in stackoverflow and on the web, doing all the instructions step by step, but I can't figure it out. I gave it up… This my error output:

Spark package found in SPARK_HOME: C:/spark/spark_3_0_1_bin_hadoop3_2 Launching java with spark-submit command
C:/spark/spark_3_0_1_bin_hadoop3_2/bin/spark-submit2.cmd  
--driver-memory "2g" sparkr-shell C:\Users\user\AppData\Local\Temp\RtmpgT8rjY\backend_port11e45fad26cf 
Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap: JVM is not ready after 10 seconds

(... why launching “spark-submit2.cmd” and not “spark-submit”?)

After running this code:

> Sys.setenv(SPARK_HOME = "C:/spark/spark_3_0_1_bin_hadoop3_2"
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g"))

What I have done so far:

  1. checked for last JRE version (JRE_8u271), for folder’s permissions and environment path: all ok
  2. rtools40-x86_64 installed and set path, then in RStudio found: C:\rtools40\usr\bin\make.exe"
  3. downloaded last pre-built version spark-3.0.1-bin-hadoop3.2.tgz and decompressed with owner permission in c:\spark (no spaces in folder names!) ) and for security I have replaced all punctuation in folder name with underscores _ as you can see in my script above. Then set environment path
  4. checked that all permissions for all users were set for C:\spark\spark_3_0_1_bin_hadoop3_2: ok
  5. Manually unzipped sparkr.zip (contained in C:\spark\spark_3_0_1_bin_hadoop3_2\R\lib) into my R library C:\Program Files\R\R-4.0.3\library
  6. downloaded winutils for hadoop v 3.0.0 and unpacked in C:\winutils\bin and set path

Successful launching of sparkR via Windows Prompt. I’ve also launched spark-submit and it was all ok.

My environment paths:

  • JAVA_HOME: C:\Java
  • R_HOME: C:\Program Files\R\R-4.0.3\bin\x64
  • RTools: C:\rtools40
  • SPARK_HOME: C:\spark\spark_3_0_1_bin_hadoop3_2
  • HADOOP_HOME: C:\winutils
  • Path: C:\Program Files\R\R-4.0.3\bin\x64;C:\rtools40;C:\rtools40\mingw64\bin;C:\Java; [...]

I use Sparklyr too and it works very well, connecting in RStudio without any problems! But not SparkR...

What can I do more to initialize SparkR in RStudio and work with its functions?

RStudio Version 1.3.1093
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=Italian_Italy.1252  LC_CTYPE=Italian_Italy.1252   
[3] LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C                  
[5] LC_TIME=Italian_Italy.1252    
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] SparkR_3.0.1
loaded via a namespace (and not attached):
[1] compiler_4.0.3 tools_4.0.3

Thanks Gabriel

1

1 Answers

0
votes

Finally I've figure it out!

I cannot explain why but I've only added in Sys.setnev a 'SPARKR_SUBMIT_ARGS'for reading csv, as I've found in some old topics of Spark 1, as here, and sparkR session initialized.

Here my new code lines:

SPARK_HOME = "C:/spark/spark_3_0_1_bin_hadoop3_2"
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.4.0" "sparkr-shell"')
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) 
sparkR.session(master = "local", sparkHome = SPARK_HOME) 

furthermore, when I had a problem reading csv with read.df function, I've changed the hadoop path without \BIN folder as follows and all now works fine:

HADOOP_HOME: C:\winutils

Hope to be useful for all who will encount my same problems.