Loading com.databricks.spark.csv via RStudio

Question

I have installed Spark-1.4.0. I have also installed its R package SparkR and I am able to use it via Spark-shell and via RStudio, however, there is one difference I can not solve.

When launching the SparkR-shell

./bin/sparkR --master local[7] --packages com.databricks:spark-csv_2.10:1.0.3

I can read a .csv-file as follows

flights <- read.df(sqlContext, "data/nycflights13.csv", "com.databricks.spark.csv", header="true")

Unfortunately, when I start SparkR via RStudio (correctly setting my SPARK_HOME) I get the following error message:

15/06/16 16:18:58 ERROR RBackendHandler: load on 1 failed
Caused by: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv

I know I should load com.databricks:spark-csv_2.10:1.0.3 in a way, but I have no idea how to do this. Could someone help me?

Followed your above steps, I'm unable to read the csv file in sparkR shell. Getting this error, Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Ta sk 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0. 0 (TID 0, localhost): java.lang.NullPointerException Do u have any idea on this? — Edwin Vivek N
I have no idea, I cannot replicate the error. I know however that your sqlContext does exists, that the input path does exist and that it correctly finds com.databricks.spark.csv, otherwise you would have other error-messages. Could you state your entire workflow? — Wannes Rosiers
I have added the details here stackoverflow.com/questions/31050823/… — Edwin Vivek N

Pragith Pragith · Accepted Answer · 2015-06-26T19:16:06

This is the right syntax (after hours of trying): (Note - You've to focus on the first line. Notice to double-quotes)

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"')

library(SparkR)
library(magrittr)

# Initialize SparkContext and SQLContext
sc <- sparkR.init(appName="SparkR-Flights-example")
sqlContext <- sparkRSQL.init(sc)


# The SparkSQL context should already be created for you as sqlContext
sqlContext
# Java ref type org.apache.spark.sql.SQLContext id 1

# Load the flights CSV file using `read.df`. Note that we use the CSV reader Spark package here.
flights <- read.df(sqlContext, "nycflights13.csv", "com.databricks.spark.csv", header="true")

Loading com.databricks.spark.csv via RStudio

4 Answers