I have a table in hive hive_tbl with columns 'col_1','col_2','col_3'. I have created a dataframe on top of above data.
Now i am getting statistics of the columns specified using describe() i am getting the result as below.
+-------+------------------+------------------+------------------+
|summary| col1 | col2 | col3 |
+-------+------------------+------------------+------------------+
| count| 17547479| 17547479| 17547479|
| mean|2.0946498354549963| 1.474746257282603|1943.9881619448768|
| stddev|1.7921560893864912|1.2898177241581452| 40126.73218327477|
| min| 0.0| 0.0| 0.0|
| max| 99.0| 60.0| 1.6240624E8|
+-------+------------------+------------------+------------------+
The above count is giving the count of the records in the entire table.
But can we apply filter specific to columns when using describe i.e. i have some blanks/values to be ignored when getting the count for some column eg col_1 has record count with good values 549023.
Can we get below result.
+-------+------------------+------------------+------------------+
|summary| col1 | col2 | col3 |
+-------+------------------+------------------+------------------+
| count| 549023 | 854049| 17547479|
| mean|2.0946498354549963| 1.474746257282603|1943.9881619448768|
| stddev|1.7921560893864912|1.2898177241581452| 40126.73218327477|
| min| 0.0| 0.0| 0.0|
| max| 99.0| 60.0| 1.6240624E8|
+-------+------------------+------------------+------------------+
describe(). So it seems yourDataFramedoes not contain Nulls? - Floriancol_1='NAME'and get the statistics for that column likewise when usingdescribe()- Aspirantdf.filter(df['col1'].isNotNull()).describe()- Sailesh Kotha