3
votes

For pandas I have a code snippet like this:

def setUnknownCatValueConditional(df, conditionCol, condition, colToSet, _valueToSet='KEINE'):
    df.loc[(df[conditionCol] == condition) & (df[colToSet].isnull()), colToSet] = _valueToSet

which conditionally will replace values in a data frame.

Trying to port this functionality to spark

df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show

Did not work out for me

df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show
warning: there was one feature warning; re-run with -feature for details
org.apache.spark.sql.AnalysisException: cannot resolve '((`A` = 'x') AND `B`)' due to data type mismatch: differing types in '((`A` = 'X') AND `B`)' (boolean and string).;;

even though df.printSchema returns a string for A and b

What is wrong here?

edit

A minimal example:

import java.sql.{ Date, Timestamp }
case class FooBar(foo:Date, bar:String)
val myDf = Seq(("2016-01-01","first"),("2016-01-02","second"),("2016-wrongFormat","noValidFormat"), ("2016-01-04","lastAssumingSameDate"))
         .toDF("foo","bar")
         .withColumn("foo", 'foo.cast("Date"))
         .as[FooBar]

myDf.printSchema
root
 |-- foo: date (nullable = true)
 |-- bar: string (nullable = true)


scala> myDf.show
+----------+--------------------+
|       foo|                 bar|
+----------+--------------------+
|2016-01-01|               first|
|2016-01-02|              second|
|      null|       noValidFormat|
|2016-01-04|lastAssumingSameDate|
+----------+--------------------+

myDf.withColumn("foo", when($"bar" === "noValidFormat" and $"foo" isNull, "noValue")).show

And the expected output

+----------+--------------------+
|       foo|                 bar|
+----------+--------------------+
|2016-01-01|               first|
|2016-01-02|              second|
| "noValue"|       noValidFormat|
|2016-01-04|lastAssumingSameDate|
+----------+--------------------+

edit2

in case chaining of conditions is required

df
    .withColumn("A",
      when(
        (($"B" === "x") and ($"B" isNull)) or
        (($"B" === "y") and ($"B" isNull)), "replacement") 

should work

1
please share example data and expected output - mtoto
@mtoto please see the edit. - Georg Heiler

1 Answers

3
votes

Mind the operator precedence. It should be:

myDf.withColumn("foo",
  when(($"bar" === "noValidFormat") and ($"foo" isNull), "noValue"))

This:

$"bar" === "noValidFormat" and $"foo" isNull

is evaluated as:

(($"bar" === "noValidFormat") and $"foo") isNull