2
votes

I am using Spark/Scala and I want to fill the nulls in my DataFrame with default values based on the type of the columns.

i.e String Columns -> "string", Numeric Columns -> 111, Boolean Columns -> False etc.

Currently the DF.na.functions API provides na.fill
fill(valueMap: Map[String, Any]) like

df.na.fill(Map(
    "A" -> "unknown",
    "B" -> 1.0
))

This requires knowing the column names and also the type of the columns.

OR

fill(value: String, cols: Seq[String])

This is only String/Double types, not even Boolean.

Is there a smart way to do this?

1
You might need to use isInstanceOf to check the incoming data type and replace with proper value. - Shankar
Please provide a reproducible example. - mtoto
Thanks for the help, I used Pattern Matching to find the type, nad created a map, and used it - Vijeth Hegde
Unfortunately even Spark v2.2.1 supports only a limited number of datatypes for DataFrame.na.fill operation. Quoting the docs, "value must be of the following type: Int, Long, Float, Double, String, Boolean." - y2k-shubham

1 Answers

7
votes

Take a look at dtypes: Array[(String, String)]. You can use the output of this method to generate a Map for fill, e.g.:

val typeMap = df.dtypes.map(column => 
    column._2 match {
        case "IntegerType" => (column._1 -> 0)
        case "StringType" => (column._1 -> "")
        case "DoubleType" => (column._1 -> 0.0)
    }).toMap