0
votes

How to skip the first line from csv and consider the second line as a header in pyspark dataframe:

prod,daily,impress
id,name,country
01,manish,USA
02,jhon,UK
03,willson,Africa

How do I skip the first line (prod daily impress) and consider (id name country) as a header using spark dataframe.

1
I don't think spark.read.csv() can directly achieve this. How hard it is to programmatically remove the first row? For example, read the file and skip the first line. Or, is the file really large like more than 10GB? - Christopher
its in between 5 to 7 gb - harish
I think @Jim Todd's answer should work. As I said I think you have to convert the data frame to RDD to achieve this, because there is probably no way to solve it in read.csv() function. - Christopher

1 Answers

0
votes

I could not think of how to get the second line as header except by hard coding. However, skipping first two (or any # of) lines from the CSV dataframe can be achieved.

>>> df = spark.read.csv("sample_csv",sep=',').rdd.zipWithIndex().filter(lambda x: x[1] > 1).map(lambda x: x[0]).toDF(['id','name','country'])
#x[1] > 1 actually skips first two lines 0 & 1
>>> df.show()
+---+-------+-------+
| id|   name|country|
+---+-------+-------+
| 01| manish|    USA|
| 02|   jhon|     UK|
| 03|willson| Africa|
+---+-------+-------+