One
First I read a tweets and parse into a tweet case class through a map into my parsing function parseTweet
:
val tweets = sc.textFile("/home/gakuo/Documents/bigdata/NintendoTweets").map(parseTweet)
Two
Then I use a function to pair RDD that results into a pair RDD of the form (hashtags, likes)
through a map inside toPairRdd:
val pairedRDD = toPairRdd(tweets).persist()
Question
After reading in my RDD in (one) above, does it help to persist it as what follows in (two)is a transformation? I am thinking, since both as lazy, then persisting is actually a waste of memory.
Three
After computing the pairRDD, I want to compute scores of each hashtag:toScores
uses reduceByKey
val scores = toScores(pairedRDD).persist()
Question
I use reduceByKey
. Does this pairRDD method result in shuffling? I have read a paper that states:
"a shuffle can occur when the resulting RDD depends on other elements from the same RDD or another RDD. cogroup, groupWith, join, leftOuterJoin, rightOuterJoin, groupByKey, reduceByKey, combineByKey, distinct, intersection, repartition, coalesce resulting in shuffling. To avoid shuffles for these kinds of operations make sure the transformation follows the same partition as the original RDD"
The same paper also states that reduceByKey
follows the same partition as the original RDD.