Scala Dataframe get max value of specific row

Question

Given a dataframe with an index column ("Z"):

val tmp= Seq(("D",0.1,0.3, 0.4), ("E",0.3, 0.1, 0.4), ("F",0.2, 0.2, 0.5)).toDF("Z", "a", "b", "c")

+---+---+---+---+
| Z |  a|  b|  c|
 ---+---+---+---+
| "D"|0.1|0.3|0.4|
| "E"|0.3|0.1|0.4|
| "F"|0.2|0.2|0.5|
+---+---+---+---+

Say im interested in the first row where Z = "D":

tmp.filter(col("Z")=== "D")
+---+---+---+---+
| Z |  a|  b|  c|
+---+---+---+---+
|"D"|0.1|0.3|0.4|
+---+---+---+---+

How do i get the min and max values of that Dataframe row and its corresponding column name while keeping the index column?

Desired output if i want top 2 max

+---+---+---
| Z |  b|c  |
+---+---+--+
| D |0.3|0.4|
+---+---+---

Desired output if i want min

+---+---+
| Z |  a|
+---+---+
| D |0.1|
+---+---+

What i tried:

// first convert that DF to an array
val tmp = df.collect.map(_.toSeq).flatten
// returns 
tmp: Array[Any] = Array(0.1, 0.3, 0.4) <---dont know why Any is returned


//take top values of array
val n = 1
tmp.zipWithIndex.sortBy(-_._1).take(n).map(_._2)

But got error:

   No implicit Ordering defined for Any.

Any way to do it straight from dataframe instead of array?

could you please provide more details regarding desired output and also let me know what is df Dataframe and what is tmp dataframe? — Nikk

Aleksei Aleksei · Accepted Answer · 2019-08-29T11:41:22

You can do something like this

tmp
  .where($"a" === 0.1)
  .take(1)
  .map { row =>
      Seq(row.getDouble(0), row.getDouble(1), row.getDouble(2))
  }
  .head
  .sortBy(d => -d)
  .take(2)

Or if you have big amount of fields you can take schema and pattern match row fields against schema data types like this

import org.apache.spark.sql.types._

val schemaWithIndex = tmp.schema.zipWithIndex

tmp
.where($"a" === 0.1)
.take(1)
.map { row =>
    for {
        tuple <- schemaWithIndex
    } yield {
        val field = tuple._1
        val index = tuple._2
        field.dataType match {
            case DoubleType => row.getDouble(index)
        }
    }
}
.head
.sortBy(d => -d)
.take(2)

Maybe there is easier way to do this.

Scala Dataframe get max value of specific row

2 Answers