PySpark DataFrame: Change cell value based on min/max condition in another column

Question

I have the following pySpark dataframe:

+------------------+------------------+--------------------+--------------+-------+
|              col1|              col2|                col3|             X|      Y|
+------------------+------------------+--------------------+--------------+-------+
|2.1729247374294496| 3.558069532647046|   6.607603368496324|             1|   null|
|0.2654841575294071|1.2633077949463256|0.023578679968183733|             0|   null|
|0.4253301781296708|3.4566490739823483| 0.11711202266039554|             3|   null|
| 2.608497168338446| 3.529397129549324|   0.373034222141551|             2|   null|
+------------------+------------------+--------------------+--------------+-------+

It is a rather simple operation and I can easily do it with pandas. However, I need to do it using only pySpark.

I want to do the following (I`ll write in sort of pseudocode):

In row where col3 == max(col3), change Y from null to 'K'

In the remaining rows, in the row where col1 == max(col1), change Y from null to 'Z'

In the remaining rows, in the row where col1 == min(col1), change Y from null to 'U'

In the remaining row: change Y from null to 'I'.

Therefore, the expected output is:

+------------------+------------------+--------------------+--------------+-------+
|              col1|              col2|                col3|             X|      Y|
+------------------+------------------+--------------------+--------------+-------+
|2.1729247374294496| 3.558069532647046|   6.607603368496324|             1|      K|
|0.2654841575294071|1.2633077949463256|0.023578679968183733|             0|      U|
|0.4253301781296708|3.4566490739823483| 0.11711202266039554|             3|      I|
| 2.608497168338446| 3.529397129549324|   0.373034222141551|             2|      Z|
+------------------+------------------+--------------------+--------------+-------+

Having that done, I need to use this table as lookup for another table:

+--------------------+--------+-----+------------------+--------------+------------+
|                  x1|      x2|   x3|                x4|             X|           d|
+--------------------+--------+-----+------------------+--------------+------------+
|0057f68a-6330-42a...|    2876|   30| 5.989999771118164|             0|    20171219|
|05cc0191-4ee4-412...|  108381|   34|24.979999542236328|             3|    20171219|
|06f353af-e9d3-4d0...|  118798|   34|               0.0|             3|    20171219|
|0c69b607-112b-4f3...|   20993|   34|               0.0|             0|    20171219|
|0d1b52ba-1502-4ff...|   23817|   34|               0.0|             0|    20171219|

I want to use the first table as lookup to create a new column in second table. The values for the new column should be looked up in column Y in first table using X column in second table as key (so we lookup values in column Y in first table corresponding to values in column X, and those values come from column X in second table).

UPD: I need a solution robust to one row satisfying two conditions, for example:

+------------------+------------------+--------------------+--------------+-------+
|              col1|              col2|                col3|             X|      Y|
+------------------+------------------+--------------------+--------------+-------+
| 2.608497168338446| 3.558069532647046|   6.607603368496324|             1|   null|
|0.2654841575294071|1.2633077949463256|0.023578679968183733|             0|   null|
|0.4253301781296708|3.4566490739823483| 0.11711202266039554|             3|   null|
|2.1729247374294496| 3.529397129549324|   0.373034222141551|             2|   null|
+------------------+------------------+--------------------+--------------+-------+

In this case row 0 satisfies both max('col3') and max('col1') conditions.

So what needs to happen is this:

Row 0 becomes 'K'

Row 3 becomes 'Z' (because out of remaining rows (0 already has 'K' row 3 satisfies max('col1') condition

Row 1 becomes 'U'

Row 2 becomes 'I'

I cannot cannot have multiple rows in table 1 with 'I' in them.

Alper t. Turker Alper t. Turker · Accepted Answer · 2018-01-24T14:15:43

Compute aggregates:

from pyspark.sql import functions as F

df = spark.createDataFrame([
    (2.1729247374294496,  3.558069532647046,    6.607603368496324, 1),
    (0.2654841575294071, 1.2633077949463256, 0.023578679968183733, 0),
    (0.4253301781296708, 3.4566490739823483,  0.11711202266039554, 3),
    (2.608497168338446,  3.529397129549324,    0.373034222141551, 2)
], ("col1", "col2", "col3", "x"))

min1, max1, max3 = df.select(F.min("col1"), F.max("col1"), F.max("col3")).first()

Add column with when:

y = (F.when(F.col("col3") == max3, "K")  # In row where col3 == max(col3), change Y from null to 'K'
    .when(F.col("col1") == max1, "Z")    # In the remaining rows, in the row where col1 == max(col1), change Y from null to 'Z'
    .when(F.col("col1") == min1, "U")    # In the remaining rows, in the row where col1 == min(col1), change Y from null to 'U'
    .otherwise("I"))                     # In the remaining row: change Y from null to 'I'

df_with_y = df.withColumn("y", y)


df_with_y.show()
# +------------------+------------------+--------------------+---+---+
# |              col1|              col2|                col3|  x|  y|
# +------------------+------------------+--------------------+---+---+
# |2.1729247374294496| 3.558069532647046|   6.607603368496324|  1|  K|
# |0.2654841575294071|1.2633077949463256|0.023578679968183733|  0|  U|
# |0.4253301781296708|3.4566490739823483| 0.11711202266039554|  3|  I|
# | 2.608497168338446| 3.529397129549324|   0.373034222141551|  2|  Z|
# +------------------+------------------+--------------------+---+---+

The values for the new column should be looked up in column Y in first table using X column in second table as key

df_with_y.select("x", "Y").join(df2, ["x"])

If y already exists, and you to preserve not null values:

df_ = spark.createDataFrame([
    (2.1729247374294496,  3.558069532647046,    6.607603368496324, 1, "G"),
    (0.2654841575294071, 1.2633077949463256, 0.023578679968183733, 0, None),
    (0.4253301781296708, 3.4566490739823483,  0.11711202266039554, 3, None),
    (2.608497168338446,  3.529397129549324,    0.373034222141551, 2, None)
], ("col1", "col2", "col3", "x", "y"))

min1_, max1_, max3_ = df.filter(F.col("y").isNull()).select(F.min("col1"), F.max("col1"), F.max("col3")).first()

y_ = (F.when(F.col("col3") == max3_, "K") 
    .when(F.col("col1") == max1_, "Z")
    .when(F.col("col1") == min1_, "U") 
    .otherwise("I"))

df_.withColumn("y", F.coalesce(F.col("y"), y_)).show()


# +------------------+------------------+--------------------+---+---+
# |              col1|              col2|                col3|  x|  y|
# +------------------+------------------+--------------------+---+---+
# |2.1729247374294496| 3.558069532647046|   6.607603368496324|  1|  G|
# |0.2654841575294071|1.2633077949463256|0.023578679968183733|  0|  U|
# |0.4253301781296708|3.4566490739823483| 0.11711202266039554|  3|  I|
# | 2.608497168338446| 3.529397129549324|   0.373034222141551|  2|  K|
# +------------------+------------------+--------------------+---+---+

If you experience numerical precision issues you can try:

threshold = 0.0000001 # Choose appropriate 

y_t = (F.when(F.abs(F.col("col3") - max3) < threshold, "K")  # In row where col3 == max(col3), change Y from null to 'K'
    .when(F.abs(F.col("col1") - max1) < threshold, "Z")    # In the remaining rows, in the row where col1 == max(col1), change Y from null to 'Z'
    .when(F.abs(F.col("col1") - min1) < threshold, "U")    # In the remaining rows, in the row where col1 == min(col1), change Y from null to 'U'
    .otherwise("I"))                     # In the remaining row: change Y from null to 'I'

df.withColumn("y", y_t).show()
# +------------------+------------------+--------------------+---+---+
# |              col1|              col2|                col3|  x|  y|
# +------------------+------------------+--------------------+---+---+
# |2.1729247374294496| 3.558069532647046|   6.607603368496324|  1|  K|
# |0.2654841575294071|1.2633077949463256|0.023578679968183733|  0|  U|
# |0.4253301781296708|3.4566490739823483| 0.11711202266039554|  3|  I|
# | 2.608497168338446| 3.529397129549324|   0.373034222141551|  2|  Z|
# +------------------+------------------+--------------------+---+---+

PySpark DataFrame: Change cell value based on min/max condition in another column

1 Answers