1
votes

The data structure is like this:

id name data
001 aaa true,false,false
002 bbb true,true,true
003 ccc false,true,true

I want to map the results in data to their names by their corresponding orders in the mapping table. In detail, the first step is to get the order number of False in data and then get the name by order number in the mapping table.

For example, the first record has two False and their index numbers are 2 and 3, so the mapping result is code2 and code3. Also, there are all true in the second record so the mapping result is an empty string.

the mapping table: ("code1","code2","code3")

the expected result:

id name data
001 aaa code2,code3
002 bbb
003 ccc code1

Is it possible to achieve this in the dataframe?

2
can you elaborate how true,false,false maps to code2,code3 and true,true,true maps to nothing?linusRian
@linusRian Sorry for forgetting to mention the mapping logic. The details are already updated.cnidaye
cnidaye, thank you ,sorry for the delay, I can see that this has already been answered :)linusRian

2 Answers

2
votes

If you are using spark 3+ you can use filter and transform functions as

val df = Seq(
  ("001", "aaa", "true,false,false"),
  ("002", "bbb", "true,true,true"),
  ("003", "ccc", "false,true,true"),
).toDF("id", "name", "data")

val cols = Seq("col1", "col2", "col3")

val dfNew = df.withColumn("data", split($"data", ","))
  .withColumn("mapping", arrays_zip($"data", typedLit(cols)))
  .withColumn("new1", filter($"mapping", (c: Column) => c.getField("data") === "false"))
  .withColumn("data", transform($"new1", (c: Column) => c.getField("1")))
  .drop("new1", "mapping")

dfNew.show(false) 

Output:

+---+----+------------+
|id |name|data        |
+---+----+------------+
|001|aaa |[col2, col3]|
|002|bbb |[]          |
|003|ccc |[col1]      |
+---+----+------------+
1
votes

The following should work but be aware that it features a posexplode (explode an array with positional value) which can be a costly operation specially if you have a huge dataset.

val df = Seq(
("001", "aaa", "true,false,false"),
("002", "bbb", "true,true,true"),
("003", "ccc", "false,true,true")
).toDF("id", "name", "data")

val codes = Seq(
(0, "code1"),
(1, "code2"),
(2, "code3")
).toDF("code_id", "codes")


val df1 = df.select($"*", posexplode(split($"data", ",")))
  .join(codes, $"pos" === $"code_id")
  .withColumn( "codes", when($"col" === "false", $"codes").otherwise(null) )

//+---+----+----------------+---+-----+-------+-----+
//| id|name|            data|pos|  col|code_id|codes|
//+---+----+----------------+---+-----+-------+-----+
//|001| aaa|true,false,false|  0| true|      0| null|
//|001| aaa|true,false,false|  1|false|      1|code2|
//|001| aaa|true,false,false|  2|false|      2|code3|
//|002| bbb|  true,true,true|  0| true|      0| null|
//|002| bbb|  true,true,true|  1| true|      1| null|
//|002| bbb|  true,true,true|  2| true|      2| null|
//|003| ccc| false,true,true|  0|false|      0|code1|
//|003| ccc| false,true,true|  1| true|      1| null|
//|003| ccc| false,true,true|  2| true|      2| null|
//+---+----+----------------+---+-----+-------+-----+

val finalDf = df1.groupBy($"id", $"name").agg(concat_ws(",", collect_list($"codes")).as("data"))

//+---+----+-----------+
//| id|name|       data|
//+---+----+-----------+
//|002| bbb|           |
//|001| aaa|code2,code3|
//|003| ccc|      code1|
//+---+----+-----------+