What I want here is to replace a value in a specific column to null if it's empty String.
The reason is I am using org.apache.spark.sql.functions.coalesce
to fill one of the Dataframe's column based on another columns, but I have noticed in some rows the value is empty String
instead of null
so the coalesce
function doesn't work as expected.
val myCoalesceColumnorder: Seq[String] = Seq("xx", "yy", "zz"),
val resolvedDf = df.select(
df("a"),
df("b"),
lower(org.apache.spark.sql.functions.coalesce(myCoalesceColumnorder.map(x => adjust(x)): _*)).as("resolved_id")
)
In the above example, I expected to first fill resolved_id
with column xx
if it' not null and if it's null with column yy
and so on. But since sometime column xx
is filled with ""
instead of null I get ""
in 'resolved_id'.
I have tried to fix it with
resolvedDf.na.replace("resolved_id", Map("" -> null))
But based on the na.replace
documentation it only works if both key and value are either Bolean
or String
or Double
so I can not use null
here.
I don't want to use UDF
because of the performance issue, I just want to know is there any other trick to solve this issue?
One other way I can fix this is by using when
but not sure about the performance
resolvedDf
.withColumn("resolved_id", when(col("resolved_id").equalTo(""), null).otherwise(col("resolved_id")))