I needed to find the maximum for a RDD using a tuple as a key. The original RDD is defined as : testRDD as this :
TestRDD(3,249345,038.9,1)
TestRDD(3,249345,785.59,2)
TestRDD(3,249345,584.9,3)
TestRDD(3,249345,427.5,4)
TestRDD(3,249345,410.71,5)
I needed to find the maximum of the 2nd column based on the tuple (1,3) I was able to acheive it by doing the following :
val agg_rdd = TestRDD.map(d => ((d.col1,d.col3),(d.col2))).groupByKey()
val max_AggRDD = agg_rdd.map{case ((col1,col3),(col2)) => (col1,col3) -> col2.max}
val ids_maxAggRDD = max_AggRDD.collect.toSet
Now I need to use the output of ids_maxAggRDD which id defined as a scala.collection.immutable.Set[((String, String), Long)] as a filter to the original testRDD.
I cant seem to be able to use the value to do this .
val Max_RDD = TestRDD.filter(v => ids_maxAggRDD.value.contains(v.col1,v.col3,v.col2)))
- Should I convert the Set of maximum ids to something
- Is there a better way to achieve what I want to accomplish?
.value
in ScalaSet
. What do you want to get as a result ofids_maxAggRDD.value
? And what the logic you want to use to computeMax_RDD
? – Alex Karpov