Pyspark explain difference with and without custom schema for reading csv

Question

I am reading a CSV file that has a header but creating my custom schema to read. I wanted to understand if there is a difference shown in explain if I provide a schema or not. My curiosity raised from this statement about read.csv on the doc

Loads a CSV file and returns the result as a DataFrame. This function will go through the input once to determine the input schema if inferSchema is enabled. To avoid going through the entire data once, disable inferSchema option or specify the schema explicitly using schema.

I could see the time delay in my prompt when I provide the schema in comparison to inferSchema being used. But I don't see any difference in explain function. Below are my code and the output with schema provided

>> friends_header_df = spark.read.csv(path='resources/fakefriends-header.csv',schema=custom_schems, header='true', sep=',')
>> print(friends_header_df._jdf.queryExecution().toString())
== Parsed Logical Plan ==
Relation[id#8,name#9,age#10,numFriends#11] csv

== Analyzed Logical Plan ==
id: int, name: string, age: int, numFriends: int
Relation[id#8,name#9,age#10,numFriends#11] csv

== Optimized Logical Plan ==
Relation[id#8,name#9,age#10,numFriends#11] csv

== Physical Plan ==
FileScan csv [id#8,name#9,age#10,numFriends#11] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/Users/sgudisa/Desktop/python data analysis workbook/spark-workbook/resour..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:string,age:int,numFriends:int>

And below for reading with inferSchema option

>> friends_noschema_df = spark.read.csv(path='resources/fakefriends-header.csv',header='true',inferSchema='true',sep=',')
>> print(friends_noschema_df._jdf.queryExecution().toString())
== Parsed Logical Plan ==
Relation[userID#32,name#33,age#34,friends#35] csv

== Analyzed Logical Plan ==
userID: int, name: string, age: int, friends: int
Relation[userID#32,name#33,age#34,friends#35] csv

== Optimized Logical Plan ==
Relation[userID#32,name#33,age#34,friends#35] csv

== Physical Plan ==
FileScan csv [userID#32,name#33,age#34,friends#35] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/Users/sgudisa/Desktop/python data analysis workbook/spark-workbook/resour..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<userID:int,name:string,age:int,friends:int>

Except for the numbers changing for the columns in Parsed Logical plan, I don't see any explanation for spark reading all the data once.

thebluephantom thebluephantom · Accepted Answer · 2020-11-08T17:57:17

InferSchema = false is the default option. You will get all columns as strings for the DF. But if you provide a schema you get your output.

Inferring a Schema means Spark will kick off an extra Job underwater to do exactly that; you can see that in fact. It will take longer but you will not see - as you state - anything in the explained Plan. Underwater is "underwater".

Pyspark explain difference with and without custom schema for reading csv

1 Answers