Write Pyspark data frame in s3

Question

I have a pyspark data frame which I want to write in s3. My data frame looks like -

id          age       gender        salary      item
1            32        M            30000        A
2            28        F            27532        B
3            39        M            32000        A
4            22        F            22000        C

While reading that data frame from s3 it looks like -

_c0         _c1       _c2           _c3         _c4
id          age       gender        salary      item
1            32        M            30000        A
2            28        F            27532        B
3            39        M            32000        A
4            22        F            22000        C

A new header is appearing.

I have done -

df.coalesce(1).write.format('csv').mode('overwrite').option("header", "false")\
.save("s3a://xxx-aaa/data/group=XXX/my_data/")

# reading the data -
final_df = spark.read.csv(s3a://xxx-aaa/data/group=XXX/my_data/")

ashish14 ashish14 · Accepted Answer · 2019-05-15T07:06:02

Use this .option("header", "true") while saving and use spark.read.csv(filepath, header=True) while reading it

Write Pyspark data frame in s3

1 Answers