As I commented, your use of null
makes things unnecessarily problematic, so if you can't work without null
in the first place, I think it makes most sense to turn it into something more useful:
val df = sparkContext.parallelize(Seq((a, 1), (a, null), (b, null), (b, 2),(b, 3),(c, 2),(c, 4),(c, 3)))
.mapValues { v => Option(v) match {
case Some(i: Int) => i
case _ => Int.MaxValue
}
}.groupBy(_._1).map {
case (k, v) => k -> v.map(_._2).max
}
First, I use Option
to get rid of null
and to move things down the tree from Any
to Int
so I can enjoy more type safety. I replace null
with MaxValue
for reasons I'll explain shortly.
Then I groupBy
as you did, but then I map
over the groups to pair the keys with the max of the values, which will either be one of your original data items or MaxValue
where the null
s once were. If you must, you can turn them back into null
, but I wouldn't.
There might be a simpler way to do all this, but I like the null
replacement with MaxValue
, the pattern matching which helps me narrow the types, and the fact I can just treat everything the same afterwards.
import org.apache.spark.sql.functions._
df.groupBy("col1").agg(collect_as_list("col2"))
now you have list on col2 try your logic here like if list.contains(null) return null else max from list – Akash Sethinull
in your RDD. – Vidya