How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

74

votes

import numpy as np

data = [
    (1, 1, None), 
    (1, 2, float(5)), 
    (1, 3, np.nan), 
    (1, 4, None), 
    (1, 5, float(10)), 
    (1, 6, float("nan")), 
    (1, 6, float("nan")),
]
df = spark.createDataFrame(data, ("session", "timestamp1", "id2"))

Expected output

dataframe with count of nan/null for each column

Note: The previous questions I found in stack overflow only checks for null & not nan. That's why I have created a new question.

I know I can use isnull() function in Spark to find number of Null values in Spark column but how to find Nan values in Spark dataframe?

apache-sparkpysparkapache-spark-sql

151

votes

You can use method shown here and replace isNull with isnan:

from pyspark.sql.functions import isnan, when, count, col

df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
|      0|         0|  3|
+-------+----------+---+

or

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
|      0|         0|  5|
+-------+----------+---+

9

votes

To make sure it does not fail for string, date and timestamp columns:

import pyspark.sql.functions as F
def count_missings(spark_df,sort=True):
    """
    Counts number of nulls and nans in each column
    """
    df = spark_df.select([F.count(F.when(F.isnan(c) | F.isnull(c), c)).alias(c) for (c,c_type) in spark_df.dtypes if c_type not in ('timestamp', 'string', 'date')]).toPandas()

    if len(df) == 0:
        print("There are no any missing values!")
        return None

    if sort:
        return df.rename(index={0: 'count'}).T.sort_values("count",ascending=False)

    return df

If you want to see the columns sorted based on the number of nans and nulls in descending:

count_missings(spark_df)

# | Col_A | 10 |
# | Col_C | 2  |
# | Col_B | 1  |

If you don't want ordering and see them as a single row:

count_missings(spark_df, False)
# | Col_A | Col_B | Col_C |
# |  10   |   1   |   2   |

6

votes

For null values in the dataframe of pyspark

Dict_Null = {col:df.filter(df[col].isNull()).count() for col in df.columns}
Dict_Null

# The output in dict where key is column name and value is null values in that column

{'#': 0,
 'Name': 0,
 'Type 1': 0,
 'Type 2': 386,
 'Total': 0,
 'HP': 0,
 'Attack': 0,
 'Defense': 0,
 'Sp_Atk': 0,
 'Sp_Def': 0,
 'Speed': 0,
 'Generation': 0,
 'Legendary': 0}

4

votes

Here is my one liner. Here 'c' is the name of the column

from pyspark.sql.functions import isnan, when, count, col, isNull
    
df.select('c').withColumn('isNull_c',F.col('c').isNull()).where('isNull_c = True').count()

1

votes

An alternative to the already provided ways is to simply filter on the column like so

import pyspark.sql.functions as F
df = df.where(F.col('columnNameHere').isNull())

This has the added benefit that you don't have to add another column to do the filtering and it's quick on larger data sets.

0

votes

I prefer this solution:

df = spark.table(selected_table).filter(condition)

counter = df.count()

df = df.select([(counter - count(c)).alias(c) for c in df.columns])

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

6 Answers