Spark dataframe aggregation scala

Question

val df = sc.parallelize(Seq((a, 1), (a, null), (b, null)(b, 2),(b, 3),(c, 2),(c, 4),(c, 3))).toDF("col1","col2")

The output should be like below.

col1 col2
a    null
b    null
c    4

I knew that groupBy on col1 and get the max of col2. which I can perform using df.groupBy("col1").agg("col2"->"max")

But my requirement is if null is there I want to select that record, but if null is not there I want to select max of col2.

How can I do this, can any please help me.

try using this import org.apache.spark.sql.functions._ df.groupBy("col1").agg(collect_as_list("col2")) now you have list on col2 try your logic here like if list.contains(null) return null else max from list — Akash Sethi
You should really rethink the premise of the question. There is no need to have a bunch of null in your RDD. — Vidya
Hello all thanks for the response. I have done this in the below way. I first modified dataframe by updating null values to a string '99-99-9999' for that column. Then I did a group by other columns and agg -->max on the column which I updated. So my requirement is fulfilled. Thanks a lot to all who responded. — Ramesh
@Ramesh : It will be good if you post your answer and accept it. So that if others are searching for similar answer it would be helpful. — USB

Vidya Vidya · Accepted Answer · 2017-03-10T23:39:01

As I commented, your use of null makes things unnecessarily problematic, so if you can't work without null in the first place, I think it makes most sense to turn it into something more useful:

val df = sparkContext.parallelize(Seq((a, 1), (a, null), (b, null), (b, 2),(b, 3),(c, 2),(c, 4),(c, 3)))
        .mapValues { v => Option(v) match {
            case Some(i: Int) => i
            case _ => Int.MaxValue
          }
        }.groupBy(_._1).map {
            case (k, v) => k -> v.map(_._2).max
        }

First, I use Option to get rid of null and to move things down the tree from Any to Int so I can enjoy more type safety. I replace null with MaxValue for reasons I'll explain shortly.

Then I groupBy as you did, but then I map over the groups to pair the keys with the max of the values, which will either be one of your original data items or MaxValue where the nulls once were. If you must, you can turn them back into null, but I wouldn't.

There might be a simpler way to do all this, but I like the null replacement with MaxValue, the pattern matching which helps me narrow the types, and the fact I can just treat everything the same afterwards.

Spark dataframe aggregation scala

1 Answers