I am trying to count the following RDD elements using count() method. The first is as follows:
scala> val data_wo_header=dropheader(data)
data_wo_header: org.apache.spark.rdd.RDD[String]
When I do Count on this I get :
scala> data_wo_header.count()
res1: Long = 20000263
This operation is relatively quick and takes about 26sec
Now I Transform the RDD as follows:
scala> val ratings_split = data_wo_header.map(line => line.split(",")).persist()
ratings_split: org.apache.spark.rdd.RDD[Array[String]]
scala> ratings_split.count()
res2: Long = 20000263
This counting takes around 5 min. Can someone suggest why the time of reading the count increased so dramatically?
The drop header function looks like this just to drop the first line:
def dropheader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}
data is just val data = sc.textFile(file, 2).cache()