Apply describe with filter on column specific in pyspark

Question

I have a table in hive hive_tbl with columns 'col_1','col_2','col_3'. I have created a dataframe on top of above data.

Now i am getting statistics of the columns specified using describe() i am getting the result as below.

+-------+------------------+------------------+------------------+
|summary|          col1    |          col2    |   col3           |
+-------+------------------+------------------+------------------+
|  count|          17547479|          17547479|          17547479|
|   mean|2.0946498354549963| 1.474746257282603|1943.9881619448768|
| stddev|1.7921560893864912|1.2898177241581452| 40126.73218327477|
|    min|               0.0|               0.0|               0.0|
|    max|              99.0|              60.0|       1.6240624E8|
+-------+------------------+------------------+------------------+

The above count is giving the count of the records in the entire table. But can we apply filter specific to columns when using describe i.e. i have some blanks/values to be ignored when getting the count for some column eg col_1 has record count with good values 549023.

Can we get below result.

+-------+------------------+------------------+------------------+
|summary|          col1    |          col2    |   col3           |
+-------+------------------+------------------+------------------+
|  count|          549023  |            854049|          17547479|
|   mean|2.0946498354549963| 1.474746257282603|1943.9881619448768|
| stddev|1.7921560893864912|1.2898177241581452| 40126.73218327477|
|    min|               0.0|               0.0|               0.0|
|    max|              99.0|              60.0|       1.6240624E8|
+-------+------------------+------------------+------------------+

What is the condition to ignore values in the summary statistics? As far as I know, Null values are not taken into account by describe(). So it seems your DataFrame does not contain Nulls? — Florian
Filter like col_1='NAME' and get the statistics for that column likewise when using describe() — Aspirant

Giorgos Myrianthous Giorgos Myrianthous · Accepted Answer · 2019-05-01T09:32:52

You can use df.na().drop() in order to discard any rows containing NaN or NULL values in particular columns. For example,

df.na.drop(subset=["col1"])

will drop all the rows where col1 is NaN or NULL. Finally, you can now describe() your filtered dataframe:

filtered_df = df.na.drop(subset=["col1"])
filtered_df.describe()

Apply describe with filter on column specific in pyspark

1 Answers