2
votes

I have dataframe of n columns and I would like to count the number of missing values in each column.

I use the following snippet of code to perform this but the output isn't what I'm expecting:

for (e <- df.columns) {
    var c: Int = df.filter( df(e).isNull || df(e) === "" || df(e).isNaN || 
                            df(e) === "-" || df(e) === "NA").count()
    println(e+":"+c)
}

Output:

column1:
column2:
column3:

How to get the count of missing values correctly based on the logic stated in the snippet?

2

2 Answers

2
votes

You can do it in a slightly different way.

  import org.apache.spark.sql.functions._

  val df = List[(Integer, Integer, Integer)]((1, null, null),(null, 2,3), (null, 3, null)).toDF("a", "b", "c")

  df.select(df.columns.map(c => count(predicate(col(c))).as(s"nulls in column $c")): _*).show()

  private def predicate(c: Column) = {
    c.isNull || c === "" || c.isNaN || c === "-" || c === "NA"
  }

This code will produce:

+-----------------+-----------------+-----------------+
|nulls in column a|nulls in column b|nulls in column c|
+-----------------+-----------------+-----------------+
|                2|                1|                2|
+-----------------+-----------------+-----------------+
0
votes
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType

var df=List[Tuple2[Integer,Integer]]((1,null),(null,2),(null,3)).toDF("name","rank")
df.show
+----+----+
|name|rank|
+----+----+
|   1|null|
|null|   2|
|null|   3|
+----+----+

val col=df.columns
var dfArray=col.map(colmn=>df.select(lit(colmn).as("colName"),sum(when(df(colmn).isNull || df(colmn)==="" || df(colmn)==="-" || df(colmn).isNaN,1).otherwise(0)).as("missingValues")))
dfArray.tail.foldLeft(dfArray.head)((acc,itr)=>acc.union(itr)).show
//output:
+-------+-------------+
|colName|missingValues|
+-------+-------------+
|   name|            2|
|   rank|            1|
+-------+-------------+