Add derived column (as array of struct) based on values and ordering of other columns in Spark Scala dataframe

Question

I have a Scala Spark dataframe with four columns (all string type) - P, Q, R, S - and a primary key (called PK) (integer type).

Each of these 4 columns may have null values. The left to right ordering of the columns is the importance/relevance of the column and needs to be preserved. The structure of the base dataframe stays the same as shown.

I want the final output to be as follows:

root
 |-- PK: integer (nullable = true)
 |-- P: string (nullable = true)
 |-- Q: string (nullable = true)
 |-- R: string (nullable = true)
 |-- S: string (nullable = true)
 |-- categoryList: array (nullable = true)
 |    |-- myStruct: struct (nullable = true)
 |    |    |-- category: boolean (nullable = true)
 |    |    |-- relevance: boolean (nullable = true)

I need to create a new column derived from the 4 columns P, Q, R, S based on the following algorithm:

For every element in each of the four rows, check whether the element exists in Map "mapM"
If element exists, the "category" in the struct will be the corresponding value from map M. If the element does not exist in Map M, the category shall be null.
The "relevance" in the struct shall be the order of the column from left to right: P -> 1, Q -> 2, R -> 3, S -> 4.
The array formed by these four structs is then added to a new column on the dataframe provided.

I'm new to Scala and here is what I have until now:

case class relevanceCaseClass(category: String, relevance: Integer)
def myUdf = udf((code: String, relevance: Integer) => relevanceCaseClass(mapM.value.getOrElse(code, null), relevance))
df.withColumn("newColumn", myUdf(col("P/Q/R/S"), 1))

The problem with this is that I cannot pass the value of the ordering inside the withColumn function. I need to let the myUdf function know the value of the relevance. Am I doing something fundamentally wrong?

Thus I should get the output:

PK   P    Q    R    S    newCol
1    a    b    c    null array(struct("a", 1), struct(null, 2), struct("c", 3), struct(null, 4))

Here, the value "b" was not found in the map and hence the value (for category) is null. Since the value for column S was already null, it stayed null. The relevance is according to the left-right column ordering.

Ramesh Maharjan Ramesh Maharjan · Accepted Answer · 2018-08-30T05:20:09

Given a input dataframe (testing as given in OP) as

+---+---+---+---+----+
|PK |P  |Q  |R  |S   |
+---+---+---+---+----+
|1  |a  |b  |c  |null|
+---+---+---+---+----+

root
 |-- PK: integer (nullable = false)
 |-- P: string (nullable = true)
 |-- Q: string (nullable = true)
 |-- R: string (nullable = true)
 |-- S: null (nullable = true)

and a broadcasted Map as

val mapM = spark.sparkContext.broadcast(Map("a" -> "a", "c" -> "c"))

You can define the udf function and call that udf function as below

def myUdf = udf((pqrs: Seq[String]) => pqrs.zipWithIndex.map(code => relevanceCaseClass(mapM.value.getOrElse(code._1, "null"), code._2+1)))
val finaldf = df.withColumn("newColumn", myUdf(array(col("P"), col("Q"), col("R"), col("S"))))

with case class as in OP

case class relevanceCaseClass(category: String, relevance: Integer)

which should give you your desired output i.e. finaldf would be

+---+---+---+---+----+--------------------------------------+
|PK |P  |Q  |R  |S   |newColumn                             |
+---+---+---+---+----+--------------------------------------+
|1  |a  |b  |c  |null|[[a, 1], [null, 2], [c, 3], [null, 4]]|
+---+---+---+---+----+--------------------------------------+

root
 |-- PK: integer (nullable = false)
 |-- P: string (nullable = true)
 |-- Q: string (nullable = true)
 |-- R: string (nullable = true)
 |-- S: null (nullable = true)
 |-- newColumn: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- category: string (nullable = true)
 |    |    |-- relevance: integer (nullable = true)

I hope the answer is helpful

Add derived column (as array of struct) based on values and ordering of other columns in Spark Scala dataframe

2 Answers