9
votes

I tried to create a standalone PySpark program that reads a csv and stores it in a hive table. I have trouble configuring Spark session, conference and contexts objects. Here is my code:

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.types import *

conf = SparkConf().setAppName("test_import")
sc = SparkContext(conf=conf)
sqlContext  = SQLContext(sc)

spark = SparkSession.builder.config(conf=conf)
dfRaw = spark.read.csv("hdfs:/user/..../test.csv",header=False)

dfRaw.createOrReplaceTempView('tempTable')
sqlContext.sql("create table customer.temp as select * from tempTable")

And I get the error:

dfRaw = spark.read.csv("hdfs:/user/../test.csv",header=False) AttributeError: 'Builder' object has no attribute 'read'

Which is the right way to configure spark session object in order to use read.csv command? Also, can someone explain the diference between Session, Context and Conference objects?

1

1 Answers

9
votes

There is no need to use both SparkContext and SparkSession to initialize Spark. SparkSession is the newer, recommended way to use.

To initialize your environment, simply do:

spark = SparkSession\
  .builder\
  .appName("test_import")\
  .getOrCreate()

You can run SQL commands by doing:

spark.sql(...)

Prior to Spark 2.0.0, three separate objects were used: SparkContext, SQLContext and HiveContext. These were used separatly depending on what you wanted to do and the data types used.

With the intruduction of the Dataset/DataFrame abstractions, the SparkSession object became the main entry point to the Spark environment. It's still possible to access the other objects by first initialize a SparkSession (say in a variable named spark) and then do spark.sparkContext/spark.sqlContext.