Spark: how to process certain column content individually in dataframe?

Question

The data structure is like this:

id	name	data
001	aaa	true,false,false
002	bbb	true,true,true
003	ccc	false,true,true

I want to map the results in data to their names by their corresponding orders in the mapping table. In detail, the first step is to get the order number of False in data and then get the name by order number in the mapping table.

For example, the first record has two False and their index numbers are 2 and 3, so the mapping result is code2 and code3. Also, there are all true in the second record so the mapping result is an empty string.

the mapping table: ("code1","code2","code3")

the expected result:

id	name	data
001	aaa	code2,code3
002	bbb
003	ccc	code1

Is it possible to achieve this in the dataframe?

can you elaborate how true,false,false maps to code2,code3 and true,true,true maps to nothing? — linusRian
@linusRian Sorry for forgetting to mention the mapping logic. The details are already updated. — cnidaye
cnidaye, thank you ,sorry for the delay, I can see that this has already been answered :) — linusRian

koiralo koiralo · Accepted Answer · 2021-07-09T05:48:32

If you are using spark 3+ you can use filter and transform functions as

val df = Seq(
  ("001", "aaa", "true,false,false"),
  ("002", "bbb", "true,true,true"),
  ("003", "ccc", "false,true,true"),
).toDF("id", "name", "data")

val cols = Seq("col1", "col2", "col3")

val dfNew = df.withColumn("data", split($"data", ","))
  .withColumn("mapping", arrays_zip($"data", typedLit(cols)))
  .withColumn("new1", filter($"mapping", (c: Column) => c.getField("data") === "false"))
  .withColumn("data", transform($"new1", (c: Column) => c.getField("1")))
  .drop("new1", "mapping")

dfNew.show(false)

Output:

+---+----+------------+
|id |name|data        |
+---+----+------------+
|001|aaa |[col2, col3]|
|002|bbb |[]          |
|003|ccc |[col1]      |
+---+----+------------+

Spark: how to process certain column content individually in dataframe?

2 Answers