4
votes

I have installed Spark-1.4.0. I have also installed its R package SparkR and I am able to use it via Spark-shell and via RStudio, however, there is one difference I can not solve.

When launching the SparkR-shell

./bin/sparkR --master local[7] --packages com.databricks:spark-csv_2.10:1.0.3

I can read a .csv-file as follows

flights <- read.df(sqlContext, "data/nycflights13.csv", "com.databricks.spark.csv", header="true")

Unfortunately, when I start SparkR via RStudio (correctly setting my SPARK_HOME) I get the following error message:

15/06/16 16:18:58 ERROR RBackendHandler: load on 1 failed
Caused by: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv

I know I should load com.databricks:spark-csv_2.10:1.0.3 in a way, but I have no idea how to do this. Could someone help me?

4
Followed your above steps, I'm unable to read the csv file in sparkR shell. Getting this error, Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Ta sk 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0. 0 (TID 0, localhost): java.lang.NullPointerException Do u have any idea on this?Edwin Vivek N
I have no idea, I cannot replicate the error. I know however that your sqlContext does exists, that the input path does exist and that it correctly finds com.databricks.spark.csv, otherwise you would have other error-messages. Could you state your entire workflow?Wannes Rosiers
I have added the details here stackoverflow.com/questions/31050823/…Edwin Vivek N

4 Answers

3
votes

This is the right syntax (after hours of trying): (Note - You've to focus on the first line. Notice to double-quotes)

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"')

library(SparkR)
library(magrittr)

# Initialize SparkContext and SQLContext
sc <- sparkR.init(appName="SparkR-Flights-example")
sqlContext <- sparkRSQL.init(sc)


# The SparkSQL context should already be created for you as sqlContext
sqlContext
# Java ref type org.apache.spark.sql.SQLContext id 1

# Load the flights CSV file using `read.df`. Note that we use the CSV reader Spark package here.
flights <- read.df(sqlContext, "nycflights13.csv", "com.databricks.spark.csv", header="true")
2
votes

My colleagues and I found the solution. We have initialized the sparkContext like this:

sc <- sparkR.init(appName="SparkR-Example",sparkEnvir=list(spark.executor.memory="1g"),sparkJars="spark-csv-assembly-1.1.0.jar")

We did not find how to load a remote jar, hence we have downloaded spark-csv_2.11-1.0.3.jar. Including this one in sparkJars however does not work, since it does not find its dependencies locally. You can add a list of jars as well, but we have build an assembly jar containing all dependencies. When loading this jar, it is possible to load the .csv-file as desired:

flights <- read.df(sqlContext, "data/nycflights13.csv","com.databricks.spark.csv",header="true")
0
votes

I have downloaded Spark-1.4.0, via command line I went to the directory Spark-1.4.0/R, where I have build the SparkR package located in the subdirectory pkg as follows:

R CMD build --resave-data pkg

This gives you a .tar file which you can install in RStudio (with devtools, you should be able to install the package in pkg as well). In RStudio, you should set your path to Spark as follows:

Sys.setenv(SPARK_HOME="path_to_spark/spark-1.4.0")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)

And you should be ready to go. I can only talk from mac experience, I hope it helps?

0
votes

If after you tried Pragith's solution above and you still having the issue. It is very possible the csv file you want to load is not in the current RStudio working directory. Use getwd() to check the RStudio directory and make sure the csv file is there.