PySpark DataFrame update column value based on min/max condition on timestamp value in another column

Question

I have the following dataframe:

col1        col2 
            
    1   2020-02-27 15:00:00 

    1   2020-02-27 15:04:00

I need the output as

col1        col2            col3

    1   2020-02-27 15:00    
    1   2020-02-27 15:04    Y

Based on the maximum timestamp value present in col2, col3 value has to be populated as Y or null.

I have tried the below approach:

df = spark.sql("select col1,col2 from table")

max_ts = df.select(max("col2")).show() 

y=(f.when(f.col('col2') == max_ts, "Y")) 

df1 = df.withColumn('col3',y)

The above approach yields only null output.

Kindly suggest the possible solution or the mistakes?

TIA.

Edit: I need to perform groupBy on col1 and fetch max value in col2

Som Som · Accepted Answer · 2020-08-06T09:05:43

Perhaps this is helpful-

DSL APIs

max(..).over(window)

df2.show(false)
    df2.printSchema()
    /**
      * +----+-------------------+
      * |col1|col2               |
      * +----+-------------------+
      * |1   |2020-02-27 15:00:00|
      * |1   |2020-02-27 15:04:00|
      * +----+-------------------+
      *
      * root
      * |-- col1: integer (nullable = true)
      * |-- col2: timestamp (nullable = true)
      */

    val w = Window.partitionBy("col1")
    df2.withColumn("col3",
      when(max("col2").over(w).cast("long") - col("col2").cast("long")=== 0, "Y")
    )
      .show(false)

    /**
      * +----+-------------------+----+
      * |col1|col2               |col3|
      * +----+-------------------+----+
      * |1   |2020-02-27 15:00:00|null|
      * |1   |2020-02-27 15:04:00|Y   |
      * +----+-------------------+----+
      */

Spark SQL

 df2.createOrReplaceTempView("table")
    spark.sql(
      """
        | select col1, col2,
        |   case when (cast(max(col2) over (partition by col1) as long) - cast(col2 as long) = 0) then 'Y' end as col3
        | from table
      """.stripMargin)
      .show(false)

    /**
      * +----+-------------------+----+
      * |col1|col2               |col3|
      * +----+-------------------+----+
      * |1   |2020-02-27 15:00:00|null|
      * |1   |2020-02-27 15:04:00|Y   |
      * +----+-------------------+----+
      */

PySpark DataFrame update column value based on min/max condition on timestamp value in another column

1 Answers

DSL APIs

Spark SQL