How to return multiple values for multiple columns in a single conditional command - Pyspark or HQL

Question

I currently have a lot of checks in my process and I want to reduce this

CASE WHEN {A > B} THEN 1 ELSE 0 END AS COL1
CASE WHEN {A = B} THEN 1 ELSE 0 END AS COL2
CASE WHEN {A < B} THEN 1 ELSE 0 END AS COL3

for this (something like)

CASE WHEN {A > B} THEN 1 AS COL1, 0 AS COL2, 0 AS COL3 
ELSE CASE WHEN {A = B} THEN 0 AS COL1, 1 AS COL2, 0 AS COL3 
ELSE 0 AS COL1, 0 AS COL2, 1 AS COL3

in my case, it's NECESSARY to do it this way, as these three columns already exist and i need to reduce the processing

Well, you can't. case is an expression that returns a single value. You could return a single value to indicate the relationship. Splitting the comparison in three columns is overkill. — Gordon Linoff
The way the system was made forces me to use these columns, otherwise I would put everything in one. Do you know if there is any way to reduce comparisons? may be in pysarpk — Gustavo F

Steven Steven · Accepted Answer · 2021-08-13T15:14:51

In pyspark, there are several complex types that you can use to do that.

Here is an example with array:

from pyspark.sql import functions as F

# Assuming your dataframe is called df
df.show()
+---+---+
|  A|  B|
+---+---+
| -1|  0|
|  0|  0|
|  1|  0|
+---+---+

df = df.withColumn(
    "test",
    F.when(F.col("A") == F.col("B"), F.array(*map(F.lit, [0, 1, 0])))
    .when(F.col("A") < F.col("B"), F.array(*map(F.lit, [0, 0, 1])))
    .when(F.col("A") > F.col("B"), F.array(*map(F.lit, [1, 0, 0]))),
)

df.show()
+---+---+---------+
|  A|  B|     test|
+---+---+---------+
| -1|  0|[0, 0, 1]|
|  0|  0|[0, 1, 0]|
|  1|  0|[1, 0, 0]|
+---+---+---------+

Once test column is created, you can assign it to other columns with getItem:

df = df.select(
    "A", "B", *(F.col("test").getItem(i).alias(f"col{i+1}") for i in range(3))
)

df.show()
+---+---+----+----+----+
|  A|  B|col1|col2|col3|
+---+---+----+----+----+
| -1|  0|   0|   0|   1|
|  0|  0|   0|   1|   0|
|  1|  0|   1|   0|   0|
+---+---+----+----+----+

How to return multiple values ​for multiple columns in a single conditional command - Pyspark or HQL

1 Answers

How to return multiple values for multiple columns in a single conditional command - Pyspark or HQL