0
votes

I want to fill nan values in spark conditionally (to make sure I considered each corner case of my data and not by simply filling anything with a replacement value).

A sample could look like

case class FooBar(foo:String, bar:String)
val myDf = Seq(("a","first"),("b","second"),("c",null), ("third","fooBar"), ("someMore","null"))
         .toDF("foo","bar")
         .as[FooBar]

+--------+------+
|     foo|   bar|
+--------+------+
|       a| first|
|       b|second|
|       c|  null|
|   third|fooBar|
|someMore|  null|
+--------+------+

Unfortunately

    myDf
        .withColumn(
          "bar",
          when(
            (($"foo" === "c") and ($"bar" isNull)) , "someReplacement" 
          )
        ).show

resets all the regular other values in the column

+--------+---------------+
|     foo|            bar|
+--------+---------------+
|       a|           null|
|       b|           null|
|       c|someReplacement|
|   third|           null|
|someMore|           null|
+--------+---------------+

and

myDf
    .withColumn(
      "bar",
      when(
        (($"foo" === "c") and ($"bar" isNull)) or
        (($"foo" === "someMore") and ($"bar" isNull)), "someReplacement" 
      )
    ).show

Which I really would want to use to fill in the values for different classes / categories of foo. does not work as well.

I am curious how to fix this.

1

1 Answers

5
votes

Use otherwise:

when(
  (($"foo" === "c") and ($"bar" isNull)) or
  (($"foo" === "someMore") and ($"bar" isNull)), "someReplacement" 
).otherwise($"bar")

or coalesce:

coalesce(
  $"bar",  
  when(($"foo" === "c") or ($"foo" === "someMore"), "someReplacement")
)

The reason for coalesce is...less typing (so you don't repeat $"bar" isNull).