how to read csv file in pyspark?

Question

I am trying to read csv file using pyspark but its showing some error. Can you tell what is the correct process to read csv file?

python code:

from pyspark.sql import *
df = spark.read.csv("D:\Users\SPate233\Downloads\iMedical\query1.csv", inferSchema = True, header = True)

i tried also below one:

sqlContext = SQLContext
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "D:\Users\SPate233\Downloads\iMedical\query1.csv")

error:

Traceback (most recent call last):
  File "<pyshell#18>", line 1, in <module>
    df = spark.read.csv("D:\Users\SPate233\Downloads\iMedical\query1.csv", inferSchema = True, header = True)
NameError: name 'spark' is not defined

and

Traceback (most recent call last):
      File "<pyshell#26>", line 1, in <module>
        df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "D:\Users\SPate233\Downloads\iMedical\query1.csv")
    AttributeError: type object 'SQLContext' has no attribute 'load'

Sagar Sagar · Accepted Answer · 2019-11-11T13:05:03

First you need to create a SparkSession like below

from pyspark.sql import SparkSession


spark = SparkSession.builder.master("yarn").appName("MyApp").getOrCreate()

and your csv needs to be on hdfs then you can use spark.csv

df = spark.read.csv('/tmp/data.csv', header=True)

where /tmp/data.csv is on hdfs

how to read csv file in pyspark?

2 Answers