How to compute the largest value in a column using withColumn?

Question

I'm trying to compute the largest value of the following DataFrame in Spark 1.6.1:

val df = sc.parallelize(Seq(1,2,3)).toDF("id")

A first approach would be to select the maximum value, and it works as expected:

df.select(max($"id")).show

The second approach could be to use withColumn as follows:

df.withColumn("max", max($"id")).show

But unfortunately it fails with the following error message:

org.apache.spark.sql.AnalysisException: expression 'id' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

How can I compute the maximum value in a withColumn function without any Window or groupBy? If not possible, how can I do it in this specific case using a Window?

How would you write what you are trying to achieve in plain-old SQL ? If there is a way, it's close to that. (I personnaly do not see intruitively how one cas express a max without some kind of aggregation first, either group by - wich is what your second attempt expects, or subquery - which is what your first working sample does), so I guess the error is only natural if you think in terms of SQL. — GPI

zero323 zero323 · Accepted Answer · 2019-01-11T13:15:24

The right approach is to compute an aggregate as a separate query and combine with the actual result. Unlike window functions, suggested in many answers here, it won't require shuffle to a single partition and will be applicable to large datasets.

It could be done withColumn using a separate action:

import org.apache.spark.sql.functions.{lit, max}

df.withColumn("max", lit(df.agg(max($"id")).as[Int].first))

but it is much cleaner to use either explicit:

import org.apache.spark.sql.functions.broadcast

df.crossJoin(broadcast(df.agg(max($"id") as "max")))

or implicit cross join:

spark.conf.set("spark.sql.crossJoin.enabled", true)

df.join(broadcast(df.agg(max($"id") as "max")))

How to compute the largest value in a column using withColumn?

3 Answers

There are few categories of functions in Apache Spark.

Implicit aggregation

aggregate functions can not mix with none-aggregate functions directly

aggregate with `over`

How to compute the largest value in a column using withColumn?

3 Answers

There are few categories of functions in Apache Spark.

Implicit aggregation

aggregate functions can not mix with none-aggregate functions directly

aggregate with over

aggregate with `over`