I have a very large dataframe in pyspark. It has over 10 million rows and over 30 columns.
What is the best and efficient method to search the entire dataframe for a given list of values and remove the row which contains that value?
The given list of values: list=['1097192','10727550','1098754'] The dataframe(df) is : +---------+--------------+---------------+---------+------------+ | id | first_name | last_name | Salary | Verifycode | +---------+--------------+---------------+---------+------------+ | 1986 | Rollie | Lewin | 1097192 | 42254172 | -Remove Row | 289743 | Karil | Sudron | 2785190 | 3703538 | | 3864 | Massimiliano | Dallicott | 1194553 | 23292573 | | 49074 | Gerry | Grinnov | 1506584 | 62291161 | | 5087654 | Nat | Leatherborrow | 1781870 | 55183252 | | 689 | Thaine | Tipple | 2150105 | 40583249 | | 7907 | Myrlene | Croley | 2883250 | 70380540 | | 887 | Nada | Redier | 2676139 | 10727550 | -Remove Row | 96533 | Sonny | Bosden | 1050067 | 13110714 | | 1098754 | Dennie | McGahy | 1804487 | 927935 | -Remove Row +---------+--------------+---------------+---------+------------+
If it was a smaller dataframe I could use collect() or toLocalIterator() functions and then iterate over the rows and remove it based on list values.
Since it is a very large dataframe what is the best way to solve this?
I have come up with this solution now but is there a better way:
column_names = df.schema.names for name in column_names: df = df.filter(~col(name).isin(list))