0
votes

One

First I read a tweets and parse into a tweet case class through a map into my parsing function parseTweet:

val tweets = sc.textFile("/home/gakuo/Documents/bigdata/NintendoTweets").map(parseTweet)

Two

Then I use a function to pair RDD that results into a pair RDD of the form (hashtags, likes) through a map inside toPairRdd:

val pairedRDD = toPairRdd(tweets).persist()

Question

After reading in my RDD in (one) above, does it help to persist it as what follows in (two)is a transformation? I am thinking, since both as lazy, then persisting is actually a waste of memory.

Three

After computing the pairRDD, I want to compute scores of each hashtag:toScores uses reduceByKey

  val scores = toScores(pairedRDD).persist()

Question

I use reduceByKey. Does this pairRDD method result in shuffling? I have read a paper that states:

"a shuffle can occur when the resulting RDD depends on other elements from the same RDD or another RDD. cogroup, groupWith, join, leftOuterJoin, rightOuterJoin, groupByKey, reduceByKey, combineByKey, distinct, intersection, repartition, coalesce resulting in shuffling. To avoid shuffles for these kinds of operations make sure the transformation follows the same partition as the original RDD"

The same paper also states that reduceByKey follows the same partition as the original RDD.

3
When persiting data, what matters are the number of actions performed on the transformed dataframe. Maybe this can help answer your question: stackoverflow.com/questions/28981359/…Shaido

3 Answers

1
votes

It's matter to use persist ( on mem/ disk/both) when you have many actions which always do the number of the same transformations again. And if it takes too long to recompute again & again.

1
votes

In your case there is no persist or caching required as it is a one-pass process. You need to know that Stages are genned putting as many transformations together before shuffling. You would have 2 here.

If you were to process some other data requirements using the pairedRDD, then persist would be advisable.

The actions are more relevant in any event.

0
votes

If you have multiple actions using the same rdd then it’s advisable to persist. I don’t see any action till now in your code. So I don’t see any reason to cache the rdd. Persist/cache also lazily evaluated.

Persist/cache - it is not guaranteed that data will be stayed during life time of execution as persisting follows LRU least recently used algorithm which may flush the data on basis of least used rdd if the memory is full. All the things need to keep in mind while using persist.

Reducebykey - it’s a wide transformation as shuffle may happen. But first of all it does combine the data w.r.t a key inside the partition first then do a reduce operation after that. So it’s less costly. Always avoid groupbykey where it shuffles the data directly without combining the the data w.r.t a key in a partition. Please avoid groupbykey while coding.