0
votes

I have 2 csv files in the same storage location directory.

1st csv File:

id name age
1  Hi   20
2  Hello 21

2nd csv File:

id name age country
3  hi1  20   India

When I read through spark

spark.read.format("csv").option("inferschema","true").load("<location>")

I can see all the data and for id 1 and 2 the country is NULL, but I am getting both the headers.

Current Output:

_c0 | _c1 | _c2 | _c3 | _c4
id   |name  |country| age | lastname
3    |dfg   |US     | 45  | HI
4    |ghj   |US1    | 33  | Hello
id   | name |country|age  | null
1    |asd   | India |21   | null
2    |sdf   |Australia|20 | null

How to get the dataframe with all the column as header and corresponding data in spark.

Expected Output:

id   |name  |country| age | lastname
3    |dfg   |US     | 45  | HI
4    |ghj   |US1    | 33  | Hello
1    |asd   | India |21   | null
2    |sdf   |Australia|20 | null
1
It's a bit unclear what you're aiming at. Can you please add the current and the expected outputs?ernest_k
Updated the querySathya
add the option of header is equal to true like .option("header","true")Nikunj Kakadiya

1 Answers

1
votes

You just need to tell spark that you have headers in the csv files that you are reading by specifying an option of header is true.

You can read your csv files within the folder as below :

val df = spark.read.format("csv").option("inferschema","true").option("header","true").load("<location>")

You can see the output as below:

enter image description here