As I commented, your use of null makes things unnecessarily problematic, so if you can't work without null in the first place, I think it makes most sense to turn it into something more useful:
val df = sparkContext.parallelize(Seq((a, 1), (a, null), (b, null), (b, 2),(b, 3),(c, 2),(c, 4),(c, 3)))
.mapValues { v => Option(v) match {
case Some(i: Int) => i
case _ => Int.MaxValue
}
}.groupBy(_._1).map {
case (k, v) => k -> v.map(_._2).max
}
First, I use Option to get rid of null and to move things down the tree from Any to Int so I can enjoy more type safety. I replace null with MaxValue for reasons I'll explain shortly.
Then I groupBy as you did, but then I map over the groups to pair the keys with the max of the values, which will either be one of your original data items or MaxValue where the nulls once were. If you must, you can turn them back into null, but I wouldn't.
There might be a simpler way to do all this, but I like the null replacement with MaxValue, the pattern matching which helps me narrow the types, and the fact I can just treat everything the same afterwards.
import org.apache.spark.sql.functions._df.groupBy("col1").agg(collect_as_list("col2"))now you have list on col2 try your logic here like if list.contains(null) return null else max from list - Akash Sethinullin your RDD. - Vidya