0
votes

I have a table with ~5k columns and ~1 M rows that looks like this:

ID Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10 Col11
ID1 0 1 0 1 0 2 1 1 2 2 0
ID2 1 0 1 0 1 2 0 0 0 2 1
ID3 1 0 0 0 0 2 1 0 1 1 1
ID4 0 1 1 1 1 1 1 1 1 0 0
ID5 0 1 1 1 1 1 0 1 0 0 0

I want to select different columns matching column names from different lists and subset the rows according to different criteria. For example if my list1 has col1, col3, col4, col11 and list2 has col2, col6, col9, col10. I want to filter rows as list1 == 0 AND list2 == 1. E.g df1 = df.filter((df.col1 == 0) & (df.col3 == 0) & (df.col4 == 0) & (df.col6== 1) & (df.col9 == 1) & (df.col10 == 1)) . Instead of adding column name each time, I want these columns to be selected from two different lists. How can I achieve this using PySpark?

Does list1 == 0 mean that all values for all selected columns are 0? Anyway, it looks quite simple: select columns + filter on desired values on all or part of the columns... or did I miss something?Christophe
Yes, all matching columns in list 1 should have value of 0 and all matching columns in list 2 should have value 1. I am confused how to filter on two or more different criteria. E.g df1 = df.filter((df.col1 == 0) & (df.col3 == 0) & (df.col4 == 0) & (df.col6== 1) & (df.col9 == 1) & (df.col10 == 1)) . Instead of adding column name each time, I want these columns to be selected from two different lists.user1388113