Does it help to persist data between transformations in Scala Spark?

Question

One

First I read a tweets and parse into a tweet case class through a map into my parsing function parseTweet:

val tweets = sc.textFile("/home/gakuo/Documents/bigdata/NintendoTweets").map(parseTweet)

Two

Then I use a function to pair RDD that results into a pair RDD of the form (hashtags, likes) through a map inside toPairRdd:

val pairedRDD = toPairRdd(tweets).persist()

Question

After reading in my RDD in (one) above, does it help to persist it as what follows in (two)is a transformation? I am thinking, since both as lazy, then persisting is actually a waste of memory.

Three

After computing the pairRDD, I want to compute scores of each hashtag:toScores uses reduceByKey

  val scores = toScores(pairedRDD).persist()

Question

I use reduceByKey. Does this pairRDD method result in shuffling? I have read a paper that states:

"a shuffle can occur when the resulting RDD depends on other elements from the same RDD or another RDD. cogroup, groupWith, join, leftOuterJoin, rightOuterJoin, groupByKey, reduceByKey, combineByKey, distinct, intersection, repartition, coalesce resulting in shuffling. To avoid shuffles for these kinds of operations make sure the transformation follows the same partition as the original RDD"

The same paper also states that reduceByKey follows the same partition as the original RDD.

When persiting data, what matters are the number of actions performed on the transformed dataframe. Maybe this can help answer your question: stackoverflow.com/questions/28981359/… — Shaido

tauitdnmd tauitdnmd · Accepted Answer · 2018-08-27T06:35:24

It's matter to use persist ( on mem/ disk/both) when you have many actions which always do the number of the same transformations again. And if it takes too long to recompute again & again.

Does it help to persist data between transformations in Scala Spark?

3 Answers