I have a table with ~5k columns and ~1 M rows that looks like this:
ID | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col8 | Col9 | Col10 | Col11 |
---|---|---|---|---|---|---|---|---|---|---|---|
ID1 | 0 | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 2 | 2 | 0 |
ID2 | 1 | 0 | 1 | 0 | 1 | 2 | 0 | 0 | 0 | 2 | 1 |
ID3 | 1 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 1 | 1 | 1 |
ID4 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
ID5 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 |
I want to select different columns matching column names from different lists and subset the rows according to different criteria. For example if my list1 has col1, col3, col4, col11 and list2 has col2, col6, col9, col10. I want to filter rows as list1 == 0 AND list2 == 1. E.g df1 = df.filter((df.col1 == 0) & (df.col3 == 0) & (df.col4 == 0) & (df.col6== 1) & (df.col9 == 1) & (df.col10 == 1))
. Instead of adding column name each time, I want these columns to be selected from two different lists. How can I achieve this using PySpark?
list1 == 0
mean that all values for all selected columns are0
? Anyway, it looks quite simple: select columns + filter on desired values on all or part of the columns... or did I miss something? – Christophedf1 = df.filter((df.col1 == 0) & (df.col3 == 0) & (df.col4 == 0) & (df.col6== 1) & (df.col9 == 1) & (df.col10 == 1))
. Instead of adding column name each time, I want these columns to be selected from two different lists. – user1388113