I have a Scala Spark dataframe with four columns (all string type) - P, Q, R, S - and a primary key (called PK) (integer type).
Each of these 4 columns may have null values. The left to right ordering of the columns is the importance/relevance of the column and needs to be preserved. The structure of the base dataframe stays the same as shown.
I want the final output to be as follows:
root
|-- PK: integer (nullable = true)
|-- P: string (nullable = true)
|-- Q: string (nullable = true)
|-- R: string (nullable = true)
|-- S: string (nullable = true)
|-- categoryList: array (nullable = true)
| |-- myStruct: struct (nullable = true)
| | |-- category: boolean (nullable = true)
| | |-- relevance: boolean (nullable = true)
I need to create a new column derived from the 4 columns P, Q, R, S based on the following algorithm:
- For every element in each of the four rows, check whether the element exists in Map "mapM"
- If element exists, the "category" in the struct will be the corresponding value from map M. If the element does not exist in Map M, the category shall be null.
- The "relevance" in the struct shall be the order of the column from left to right: P -> 1, Q -> 2, R -> 3, S -> 4.
- The array formed by these four structs is then added to a new column on the dataframe provided.
I'm new to Scala and here is what I have until now:
case class relevanceCaseClass(category: String, relevance: Integer)
def myUdf = udf((code: String, relevance: Integer) => relevanceCaseClass(mapM.value.getOrElse(code, null), relevance))
df.withColumn("newColumn", myUdf(col("P/Q/R/S"), 1))
The problem with this is that I cannot pass the value of the ordering inside the withColumn function. I need to let the myUdf function know the value of the relevance. Am I doing something fundamentally wrong?
Thus I should get the output:
PK P Q R S newCol
1 a b c null array(struct("a", 1), struct(null, 2), struct("c", 3), struct(null, 4))
Here, the value "b" was not found in the map and hence the value (for category) is null. Since the value for column S was already null, it stayed null. The relevance is according to the left-right column ordering.