Spark Dataframe - Display empty row count for each column

Question

I have dataframe of n columns and I would like to count the number of missing values in each column.

I use the following snippet of code to perform this but the output isn't what I'm expecting:

for (e <- df.columns) {
    var c: Int = df.filter( df(e).isNull || df(e) === "" || df(e).isNaN || 
                            df(e) === "-" || df(e) === "NA").count()
    println(e+":"+c)
}

Output:

column1:
column2:
column3:

How to get the count of missing values correctly based on the logic stated in the snippet?

addmeaning addmeaning · Accepted Answer · 2018-08-07T12:23:52

You can do it in a slightly different way.

  import org.apache.spark.sql.functions._

  val df = List[(Integer, Integer, Integer)]((1, null, null),(null, 2,3), (null, 3, null)).toDF("a", "b", "c")

  df.select(df.columns.map(c => count(predicate(col(c))).as(s"nulls in column $c")): _*).show()

  private def predicate(c: Column) = {
    c.isNull || c === "" || c.isNaN || c === "-" || c === "NA"
  }

This code will produce:

+-----------------+-----------------+-----------------+
|nulls in column a|nulls in column b|nulls in column c|
+-----------------+-----------------+-----------------+
|                2|                1|                2|
+-----------------+-----------------+-----------------+

Spark Dataframe - Display empty row count for each column

2 Answers