Can parquet, avro and other hadoop file formats have different layout for first line?

Question

Why do I have to convert an RDD to DF in order to write it as parquet, avro or other types? I know writing RDD as these formats is not supported. I was actually trying to write a parquet file with first line containing only the header date and other lines containing the detail records. A sample file layout

2019-04-06
101,peter,20000
102,robin,25000

I want to create a parquet with the above contents. I already have a csv file sample.csv with above contents. The csv file when read as dataframe contains only the first field as the first row has only one column.

rdd = sc.textFile('hdfs://somepath/sample.csv')
df = rdd.toDF()
df.show()

o/p:

2019-04-06
101
102

Could someone please help me with converting the entire contents of rdd into dataframe. Even when i try reading the file directly as a df instead of converting from rdd same thing happens.

OneCricketeer OneCricketeer · Accepted Answer · 2019-04-06T20:13:57

Your file only has "one column" in Spark's reader, so therefore the dataframe output will only be that.

You didn't necessarily do anything wrong, but your input file is malformed if you expect there to be more than one column, and if so, you should be using spark.csv() instead of sc.textFile()

Why do I have to convert an RDD to DF in order to write it as parquet, avro or other types?

Because those types need a schema, which RDD has none.

trying to write a parquet file with first line containing only the header date and other lines containing the detail records

CSV file headers need to describe all columns. There cannot be an isloated header above all rows.

Parqeut/Avro/ORC/JSON cannot do not have column headers like CSV, but the same applies.

Can parquet, avro and other hadoop file formats have different layout for first line?

1 Answers