2
votes

I want to read a csv file into a RDD using Spark 2.0. I can read it into a dataframe using

df = session.read.csv("myCSV.csv", header=True,)

and I can load it as a text file and then process it using

import csv
rdd = context.textFile("myCSV.csv")
header = rdd.first().replace('"','').split(',')
rdd = (rdd.mapPartitionsWithIndex(lambda idx, itr: itr.drop(1) if idx == 0 else itr)
        .mapPartitions(lambda x: csv.reader(x))
      )

Is there a way to use the built in csv reader (spark-csv) to go straight to an RDD without having to convert from a dataframe to a csv? Or maybe the above RDD method is good enough as the built in reader does something similar under the hood?

Edit: 1) Again, I don't want to read into a dataframe and then convert to RDD. This will build up an entire structure only to have it immediately dropped. Seems pointless. 2) Yes, I can time the above (against DF -> RDD conversion), but that will only tell me if my RDD read solution is better than conversion. A built in csv to RDD method will most likely be more optimized than the above code.

1

1 Answers

4
votes

You can convert a dataframe to a rdd by using .rdd as in below

rdd = session.read.csv("myCSV.csv", header=True).rdd