0
votes

I needed to find the maximum for a RDD using a tuple as a key. The original RDD is defined as : testRDD as this :

TestRDD(3,249345,038.9,1)
TestRDD(3,249345,785.59,2)
TestRDD(3,249345,584.9,3)
TestRDD(3,249345,427.5,4)
TestRDD(3,249345,410.71,5)

I needed to find the maximum of the 2nd column based on the tuple (1,3) I was able to acheive it by doing the following :

val agg_rdd = TestRDD.map(d => ((d.col1,d.col3),(d.col2))).groupByKey()
val max_AggRDD = agg_rdd.map{case ((col1,col3),(col2)) => (col1,col3) -> col2.max}
val ids_maxAggRDD = max_AggRDD.collect.toSet

Now I need to use the output of ids_maxAggRDD which id defined as a scala.collection.immutable.Set[((String, String), Long)] as a filter to the original testRDD.

I cant seem to be able to use the value to do this .

 val Max_RDD = TestRDD.filter(v => ids_maxAggRDD.value.contains(v.col1,v.col3,v.col2)))
  1. Should I convert the Set of maximum ids to something
  2. Is there a better way to achieve what I want to accomplish?
1
You don't have .value in Scala Set. What do you want to get as a result of ids_maxAggRDD.value? And what the logic you want to use to compute Max_RDD?Alex Karpov

1 Answers

0
votes

I was able to get it to work by just using contains without the .value. Not sure if this is the best approach