2
votes

I'm trying to fill an empty value as null when I split a column in Spark. Example:

| A        |
| 1.2.3    |
| 4..5     |

I was looking for:

A A split 1 A split 2 A split 3
1.2.3 1 2 3
4..5 4 null 5

I got:

A A split 1 A split 2 A split 3
1.2.3 1 2 3
4..5 4 5

My code is:

df.withColumn("A", when(split(col("A"), "\\.") =!= lit(""), split(col("A"), "\\."))

However, I got an error because due to a type mismatch:

array(string) is not a string.

It could be possible to find a solution without using a UDF?

Many thanks

2
Assuming split part is resolved, do you want to create new columns from arrays? Or just want to replace the empty string with null inside the split array.jrook

2 Answers

1
votes

You can split then when getting array items as columns use when to change to null if element is empty :

// n is the max array size from split (in your example it's 3)
val n = 3

val df1 = df.withColumn(
    "ASplit",
    split(col("A"), "[.]")
  ).select(
    Seq(col("A")) ++ (0 to n-1).map(i =>
      when(col("ASplit")(i) === "", lit(null)).otherwise(col("ASplit")(i)).as(s"A split $i")
    ): _*
  )
    
//+-----+---------+---------+---------+
//|    A|A split 0|A split 1|A split 2|
//+-----+---------+---------+---------+
//|1.2.3|        1|        2|        3|
//| 4..5|        4|     null|        5|
//+-----+---------+---------+---------+
1
votes

You can transform the split result by replacing empty values with null:

val result = df.withColumn(
    "split",
    expr("transform(split(A, '\\\\.'), x -> case when x = '' then null else x end)")
).select($"A", $"split"(0), $"split"(1), $"split"(2))

result.show
+-----+--------+--------+--------+
|    A|split[0]|split[1]|split[2]|
+-----+--------+--------+--------+
|1.2.3|       1|       2|       3|
| 4..5|       4|    null|       5|
+-----+--------+--------+--------+