1
votes

I would like to transform some columns in my dataframe based on configuration represented by Scala maps.

I have 2 case:

  1. Receiving a map Map[String, Seq[String]] and columns col1, col2, to transform col3 if there is an entity in a map with key = col1, and col2 is in this entity value list.
  2. Receiving a map Map[String, (Long, Long) and col1, col2, to transform col3 if there is an entity in a map with key = col1 and col2 is in a range describe by the tuple of Longs as (start, end).

examples:

case 1 having this table, and a map Map(u1-> Seq(w1,w11), u2 -> Seq(w2,w22))

+------+------+------+
| col1 | col2 | col3 | 
+------+------+------+
| u1   | w1   | v1   |
+------+------+------+
| u2   | w2   | v2   |
+------+------+------+
| u3   | w3   | v3   |
+------+------+------+

I would like to add "x-" prefix to col3, only if it matchs the term

+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1   | w1   | x-v1 |
+------+------+------+
| u2   | w2   | x-v2 |
+------+------+------+
| u3   | w3   | v3   |
+------+------+------+

case 2: This table and map Map("u1" -> (1,5), u2 -> (2, 4))

+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1   | 2    | v1   |
+------+------+------+
| u1   | 6    | v11  |
+------+------+------+
| u2   | 3    | v3   |
+------+------+------+
| u3   | 4    | v3   |
+------+------+------+

expected output should be:

+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1   | 2    | x-v1 |
+------+------+------+
| u1   | 6    | v11  |
+------+------+------+
| u2   | 3    | x-v3 |
+------+------+------+
| u3   | 4    | v3   |
+------+------+------+

This can easily be done by UDFs, but for performance concerned, I would like not to use them.

Is there a way to achieve it without it in Spark 2.4.2?

Thanks

3
can you also add sample input & expected output ?? & what spark version ??Srinivas
@Srinivas examples added, thanksLeonB
spark version ??Srinivas
@Srinivas Spark 2.4.2LeonB
is it ok, if i convert this Map("u1" -> (1,5), u2 -> (2, 4)) to Map("u1" -> Seq(1,5), u2 -> Seq(2, 4)) ?Srinivas

3 Answers

2
votes

Check below code.

Note -

  • I have changed your second case map value to Map("u1" -> Seq(1,5), u2 -> Seq(2, 4))
  • Converting map values to json map, adding json map as column values to DataFrame, then applying logic on DataFrame.
  • If possible you can directly add values inside json map so that you can avoid conversion map to json map.

Import required libraries.

import org.apache.spark.sql.types._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._

Case-1 Logic

scala> val caseOneDF = Seq(("u1","w1","v1"),("u2","w2","v2"),("u3","w3","v3")).toDF("col1","col2","col3")
caseOneDF: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more field]
scala> val caseOneMap = Map("u1" -> Seq("w1","w11"),"u2" -> Seq("w2","w22"))
caseOneMap: scala.collection.immutable.Map[String,Seq[String]] = Map(u1 -> List(w1, w11), u2 -> List(w2, w22))
scala> val caseOneJsonMap = lit(compact(render(caseOneMap)))
caseOneJsonMap: org.apache.spark.sql.Column = {"u1":["w1","w11"],"u2":["w2","w22"]}
scala> val caseOneSchema = MapType(StringType,ArrayType(StringType))
caseOneSchema: org.apache.spark.sql.types.MapType = MapType(StringType,ArrayType(StringType,true),true)
scala> val caseOneExpr = from_json(caseOneJsonMap,caseOneSchema)
caseOneExpr: org.apache.spark.sql.Column = entries

Case-1 Final Output

scala> dfa
.withColumn("data",caseOneExpr)
.withColumn("col3",when(expr("array_contains(data[col1],col2)"),concat(lit("x-"),$"col3")).otherwise($"col3"))
.drop("data")
.show(false)

+----+----+----+
|col1|col2|col3|
+----+----+----+
|u1  |w1  |x-v1|
|u2  |w2  |x-v2|
|u3  |w3  |v3  |
+----+----+----+

Case-2 Logic

scala> val caseTwoDF = Seq(("u1",2,"v1"),("u1",6,"v11"),("u2",3,"v3"),("u3",4,"v3")).toDF("col1","col2","col3")
caseTwoDF: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field]
scala> val caseTwoMap = Map("u1" -> Seq(1,5),"u2" -> Seq(2,4))
caseTwoMap: scala.collection.immutable.Map[String,Seq[Int]] = Map(u1 -> List(1, 5), u2 -> List(2, 4))
scala> val caseTwoJsonMap = lit(compact(render(caseTwoMap)))
caseTwoJsonMap: org.apache.spark.sql.Column = {"u1":[1,5],"u2":[2,4]}
scala> val caseTwoSchema = MapType(StringType,ArrayType(IntegerType))
caseTwoSchema: org.apache.spark.sql.types.MapType = MapType(StringType,ArrayType(IntegerType,true),true)
scala> val caseTwoExpr = from_json(caseTwoJsonMap,caseTwoSchema)
caseTwoExpr: org.apache.spark.sql.Column = entries

Case-2 Final Output

scala> caseTwoDF
.withColumn("data",caseTwoExpr)
.withColumn("col3",when(expr("array_contains(sequence(data[col1][0],data[col1][1]),col2)"), concat(lit("x-"),$"col3")).otherwise($"col3"))
.drop("data")
.show(false)

+----+----+----+
|col1|col2|col3|
+----+----+----+
|u1  |2   |x-v1|
|u1  |6   |v11 |
|u2  |3   |x-v3|
|u3  |4   |v3  |
+----+----+----+
2
votes

Another alternative -

import org.apache.spark.sql.functions.typedLit

Case-1

df1.show(false)
    df1.printSchema()
    /**
      * +----+----+----+
      * |col1|col2|col3|
      * +----+----+----+
      * |u1  |w1  |v1  |
      * |u2  |w2  |v2  |
      * |u3  |w3  |v3  |
      * +----+----+----+
      *
      * root
      * |-- col1: string (nullable = true)
      * |-- col2: string (nullable = true)
      * |-- col3: string (nullable = true)
      */
 val case1 = Map("u1" -> Seq("w1","w11"), "u2" -> Seq("w2","w22"))

    val p1 = df1.withColumn("case1", typedLit(case1))
      .withColumn("col3",
        when(array_contains(expr("case1[col1]"), $"col2"), concat(lit("x-"), $"col3"))
          .otherwise($"col3")
      )
    p1.show(false)
    p1.printSchema()
    /**
      * +----+----+----+----------------------------------+
      * |col1|col2|col3|case1                             |
      * +----+----+----+----------------------------------+
      * |u1  |w1  |x-v1|[u1 -> [w1, w11], u2 -> [w2, w22]]|
      * |u2  |w2  |x-v2|[u1 -> [w1, w11], u2 -> [w2, w22]]|
      * |u3  |w3  |v3  |[u1 -> [w1, w11], u2 -> [w2, w22]]|
      * +----+----+----+----------------------------------+
      *
      * root
      * |-- col1: string (nullable = true)
      * |-- col2: string (nullable = true)
      * |-- col3: string (nullable = true)
      * |-- case1: map (nullable = false)
      * |    |-- key: string
      * |    |-- value: array (valueContainsNull = true)
      * |    |    |-- element: string (containsNull = true)
      */

Case-2

df2.show(false)
    df2.printSchema()
    /**
      * +----+----+----+
      * |col1|col2|col3|
      * +----+----+----+
      * |u1  |2   |v1  |
      * |u1  |6   |v11 |
      * |u2  |3   |v3  |
      * |u3  |4   |v3  |
      * +----+----+----+
      *
      * root
      * |-- col1: string (nullable = true)
      * |-- col2: integer (nullable = true)
      * |-- col3: string (nullable = true)
      */
val case2 = Map("u1" -> (1,5), "u2" -> (2, 4))
    val p = df2.withColumn("case2", typedLit(case2))
      .withColumn("col3",
        when(expr("col2 between case2[col1]._1 and case2[col1]._2"), concat(lit("x-"), $"col3"))
          .otherwise($"col3")
      )
    p.show(false)
    p.printSchema()

    /**
      * +----+----+----+----------------------------+
      * |col1|col2|col3|case2                       |
      * +----+----+----+----------------------------+
      * |u1  |2   |x-v1|[u1 -> [1, 5], u2 -> [2, 4]]|
      * |u1  |6   |v11 |[u1 -> [1, 5], u2 -> [2, 4]]|
      * |u2  |3   |x-v3|[u1 -> [1, 5], u2 -> [2, 4]]|
      * |u3  |4   |v3  |[u1 -> [1, 5], u2 -> [2, 4]]|
      * +----+----+----+----------------------------+
      *
      * root
      * |-- col1: string (nullable = true)
      * |-- col2: integer (nullable = true)
      * |-- col3: string (nullable = true)
      * |-- case2: map (nullable = false)
      * |    |-- key: string
      * |    |-- value: struct (valueContainsNull = true)
      * |    |    |-- _1: integer (nullable = false)
      * |    |    |-- _2: integer (nullable = false)
      */
0
votes
scala> caseTwoDF
.withColumn("data",caseTwoExpr)
.withColumn("col3",when(expr("array_contains(sequence(data[col1][0],data[col1][1]),col2)"), concat(lit("x-"),$"col3")).otherwise($"col3"))
.drop("data")
.show(false)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|u1  |2   |x-v1|
|u1  |6   |v11 |
|u2  |3   |x-v3|
|u3  |4   |v3  |
+----+----+----+