0
votes

I am trying to read csv file using pyspark but its showing some error. Can you tell what is the correct process to read csv file?

python code:

from pyspark.sql import *
df = spark.read.csv("D:\Users\SPate233\Downloads\iMedical\query1.csv", inferSchema = True, header = True)

i tried also below one:

sqlContext = SQLContext
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "D:\Users\SPate233\Downloads\iMedical\query1.csv")

error:

Traceback (most recent call last):
  File "<pyshell#18>", line 1, in <module>
    df = spark.read.csv("D:\Users\SPate233\Downloads\iMedical\query1.csv", inferSchema = True, header = True)
NameError: name 'spark' is not defined

and

Traceback (most recent call last):
      File "<pyshell#26>", line 1, in <module>
        df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "D:\Users\SPate233\Downloads\iMedical\query1.csv")
    AttributeError: type object 'SQLContext' has no attribute 'load'
2

2 Answers

1
votes

First you need to create a SparkSession like below

from pyspark.sql import SparkSession


spark = SparkSession.builder.master("yarn").appName("MyApp").getOrCreate()

and your csv needs to be on hdfs then you can use spark.csv

df = spark.read.csv('/tmp/data.csv', header=True)

where /tmp/data.csv is on hdfs

0
votes

The simplest to read csv in pyspark - use Databrick's spark-csv module.

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file.csv')

Also you can read by string and parse to your separator.

reader = sc.textFile("file.csv").map(lambda line: line.split(","))