Transform columns in Spark DataFrame based on map without using UDFs

Question

I would like to transform some columns in my dataframe based on configuration represented by Scala maps.

I have 2 case:

Receiving a map Map[String, Seq[String]] and columns col1, col2, to transform col3 if there is an entity in a map with key = col1, and col2 is in this entity value list.
Receiving a map Map[String, (Long, Long) and col1, col2, to transform col3 if there is an entity in a map with key = col1 and col2 is in a range describe by the tuple of Longs as (start, end).

examples:

case 1 having this table, and a map Map(u1-> Seq(w1,w11), u2 -> Seq(w2,w22))

+------+------+------+
| col1 | col2 | col3 | 
+------+------+------+
| u1   | w1   | v1   |
+------+------+------+
| u2   | w2   | v2   |
+------+------+------+
| u3   | w3   | v3   |
+------+------+------+

I would like to add "x-" prefix to col3, only if it matchs the term

+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1   | w1   | x-v1 |
+------+------+------+
| u2   | w2   | x-v2 |
+------+------+------+
| u3   | w3   | v3   |
+------+------+------+

case 2: This table and map Map("u1" -> (1,5), u2 -> (2, 4))

+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1   | 2    | v1   |
+------+------+------+
| u1   | 6    | v11  |
+------+------+------+
| u2   | 3    | v3   |
+------+------+------+
| u3   | 4    | v3   |
+------+------+------+

expected output should be:

+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1   | 2    | x-v1 |
+------+------+------+
| u1   | 6    | v11  |
+------+------+------+
| u2   | 3    | x-v3 |
+------+------+------+
| u3   | 4    | v3   |
+------+------+------+

This can easily be done by UDFs, but for performance concerned, I would like not to use them.

Is there a way to achieve it without it in Spark 2.4.2?

Thanks

can you also add sample input & expected output ?? & what spark version ?? — Srinivas
is it ok, if i convert this Map("u1" -> (1,5), u2 -> (2, 4)) to Map("u1" -> Seq(1,5), u2 -> Seq(2, 4)) ? — Srinivas

Srinivas Srinivas · Accepted Answer · 2020-07-20T12:04:47

Check below code.

Note -

I have changed your second case map value to Map("u1" -> Seq(1,5), u2 -> Seq(2, 4))
Converting map values to json map, adding json map as column values to DataFrame, then applying logic on DataFrame.
If possible you can directly add values inside json map so that you can avoid conversion map to json map.

Import required libraries.

import org.apache.spark.sql.types._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._

Case-1 Logic

scala> val caseOneDF = Seq(("u1","w1","v1"),("u2","w2","v2"),("u3","w3","v3")).toDF("col1","col2","col3")
caseOneDF: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more field]

scala> val caseOneMap = Map("u1" -> Seq("w1","w11"),"u2" -> Seq("w2","w22"))
caseOneMap: scala.collection.immutable.Map[String,Seq[String]] = Map(u1 -> List(w1, w11), u2 -> List(w2, w22))

scala> val caseOneJsonMap = lit(compact(render(caseOneMap)))
caseOneJsonMap: org.apache.spark.sql.Column = {"u1":["w1","w11"],"u2":["w2","w22"]}

scala> val caseOneSchema = MapType(StringType,ArrayType(StringType))
caseOneSchema: org.apache.spark.sql.types.MapType = MapType(StringType,ArrayType(StringType,true),true)

scala> val caseOneExpr = from_json(caseOneJsonMap,caseOneSchema)
caseOneExpr: org.apache.spark.sql.Column = entries

Case-1 Final Output

scala> dfa
.withColumn("data",caseOneExpr)
.withColumn("col3",when(expr("array_contains(data[col1],col2)"),concat(lit("x-"),$"col3")).otherwise($"col3"))
.drop("data")
.show(false)

+----+----+----+
|col1|col2|col3|
+----+----+----+
|u1  |w1  |x-v1|
|u2  |w2  |x-v2|
|u3  |w3  |v3  |
+----+----+----+

Case-2 Logic

scala> val caseTwoDF = Seq(("u1",2,"v1"),("u1",6,"v11"),("u2",3,"v3"),("u3",4,"v3")).toDF("col1","col2","col3")
caseTwoDF: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field]

scala> val caseTwoMap = Map("u1" -> Seq(1,5),"u2" -> Seq(2,4))
caseTwoMap: scala.collection.immutable.Map[String,Seq[Int]] = Map(u1 -> List(1, 5), u2 -> List(2, 4))

scala> val caseTwoJsonMap = lit(compact(render(caseTwoMap)))
caseTwoJsonMap: org.apache.spark.sql.Column = {"u1":[1,5],"u2":[2,4]}

scala> val caseTwoSchema = MapType(StringType,ArrayType(IntegerType))
caseTwoSchema: org.apache.spark.sql.types.MapType = MapType(StringType,ArrayType(IntegerType,true),true)

scala> val caseTwoExpr = from_json(caseTwoJsonMap,caseTwoSchema)
caseTwoExpr: org.apache.spark.sql.Column = entries

Case-2 Final Output

scala> caseTwoDF
.withColumn("data",caseTwoExpr)
.withColumn("col3",when(expr("array_contains(sequence(data[col1][0],data[col1][1]),col2)"), concat(lit("x-"),$"col3")).otherwise($"col3"))
.drop("data")
.show(false)

+----+----+----+
|col1|col2|col3|
+----+----+----+
|u1  |2   |x-v1|
|u1  |6   |v11 |
|u2  |3   |x-v3|
|u3  |4   |v3  |
+----+----+----+

Transform columns in Spark DataFrame based on map without using UDFs

3 Answers

Case-1

Case-2