How to skip unwanted headers from csv file using spark dataframe(python/pyspark)

Question

How to skip the first line from csv and consider the second line as a header in pyspark dataframe:

prod,daily,impress
id,name,country
01,manish,USA
02,jhon,UK
03,willson,Africa

How do I skip the first line (prod daily impress) and consider (id name country) as a header using spark dataframe.

I don't think spark.read.csv() can directly achieve this. How hard it is to programmatically remove the first row? For example, read the file and skip the first line. Or, is the file really large like more than 10GB? — Christopher
Possible duplicate of Apache Spark Dataframe - Load data from nth line of a CSV file — abiratsis
I think @Jim Todd's answer should work. As I said I think you have to convert the data frame to RDD to achieve this, because there is probably no way to solve it in read.csv() function. — Christopher

Jim Todd Jim Todd · Accepted Answer · 2019-04-08T08:15:49

I could not think of how to get the second line as header except by hard coding. However, skipping first two (or any # of) lines from the CSV dataframe can be achieved.

>>> df = spark.read.csv("sample_csv",sep=',').rdd.zipWithIndex().filter(lambda x: x[1] > 1).map(lambda x: x[0]).toDF(['id','name','country'])
#x[1] > 1 actually skips first two lines 0 & 1
>>> df.show()
+---+-------+-------+
| id|   name|country|
+---+-------+-------+
| 01| manish|    USA|
| 02|   jhon|     UK|
| 03|willson| Africa|
+---+-------+-------+

How to skip unwanted headers from csv file using spark dataframe(python/pyspark)

1 Answers