2
votes

What I want here is to replace a value in a specific column to null if it's empty String.

The reason is I am using org.apache.spark.sql.functions.coalesce to fill one of the Dataframe's column based on another columns, but I have noticed in some rows the value is empty String instead of null so the coalesce function doesn't work as expected.

val myCoalesceColumnorder: Seq[String] = Seq("xx", "yy", "zz"),

val resolvedDf = df.select(
   df("a"),
   df("b"),
   lower(org.apache.spark.sql.functions.coalesce(myCoalesceColumnorder.map(x => adjust(x)): _*)).as("resolved_id")
)

In the above example, I expected to first fill resolved_id with column xx if it' not null and if it's null with column yy and so on. But since sometime column xx is filled with "" instead of null I get "" in 'resolved_id'.

I have tried to fix it with

resolvedDf.na.replace("resolved_id", Map("" -> null))

But based on the na.replace documentation it only works if both key and value are either Bolean or String or Double so I can not use null here.

I don't want to use UDF because of the performance issue, I just want to know is there any other trick to solve this issue?

One other way I can fix this is by using when but not sure about the performance

resolvedDf
      .withColumn("resolved_id", when(col("resolved_id").equalTo(""), null).otherwise(col("resolved_id")))
1
please look at this question stackoverflow.com/questions/45615621/…Raman Mishra

1 Answers

2
votes

This is the right way with better performance
resolvedDf.withColumn("resolved_id", when($"resolved_id" !== "", $"resolved_id"))

Basically no need to use otherwise method.

You can check sources::: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L507

/**
   * Evaluates a list of conditions and returns one of multiple possible result expressions.
   * If otherwise is not defined at the end, null is returned for unmatched conditions.
   *
   * {{{
   *   // Example: encoding gender string column into integer.
   *
   *   // Scala:
   *   people.select(when(people("gender") === "male", 0)
   *     .when(people("gender") === "female", 1)
   *     .otherwise(2))
   *
   *   // Java:
   *   people.select(when(col("gender").equalTo("male"), 0)
   *     .when(col("gender").equalTo("female"), 1)
   *     .otherwise(2))
   * }}}
   *
   * @group expr_ops
   * @since 1.4.0
   */
  def when(condition: Column, value: Any): Column = this.expr match {
    case CaseWhen(branches, None) =>
      withExpr { CaseWhen(branches :+ ((condition.expr, lit(value).expr))) }
    case CaseWhen(branches, Some(_)) =>
      throw new IllegalArgumentException(
        "when() cannot be applied once otherwise() is applied")
    case _ =>
      throw new IllegalArgumentException(
        "when() can only be applied on a Column previously generated by when() function")
  }