0
votes

I have a large dataframe with 30+ columns.

The first two columns ("A" and "B") contain general infromation about a feature, while the rest of columns represent different experiments for that feature.

I want to slice my dataframe to only contain rows with respective columns, in which any value was below a threshold, and furthermore include columns A and B.

My current code looks like this :

df=df.loc[df.le(0.05).any(1),].reindex(colums=["A","B"]+list(df.columns[df.le(0.05).any(0)]))

While it works, this looks very inelegant. Is there a "nicer" way to go about?

edit: Example

A B C D E
1 10 0.9 0.8 0.04
1 20 0.7 0.6 0.5
2 5 0.1 0.01 0.3

and as result

A B D E
1 10 0.8 0.04
2 5 0.01 0.3
2

2 Answers

0
votes

Let's start with this input:

import numpy as np
np.random.seed(1)
df = pd.concat([pd.DataFrame({'A': ['feature']*5,
                              'B': ['feature']*5,}),
                pd.DataFrame({'col%s' % i: np.random.randint(0,5,5) for i in range(10)})
               ], axis=1)

input:

         A        B  col0  col1  col2  col3  col4
0  feature  feature     3     0     1     3     4
1  feature  feature     4     0     2     4     1
2  feature  feature     0     1     4     2     1
3  feature  feature     1     4     2     4     0
4  feature  feature     3     4     4     2     1

Now we select the rows where there is not at least one 0 (just an example, you can use any selector here):

df[df.drop(['A', 'B'], axis=1).le(0).any(axis=1)]

output:

         A        B  col0  col1  col2  col3  col4
0  feature  feature     3     0     1     3     4
1  feature  feature     4     0     2     4     1
2  feature  feature     0     1     4     2     1
3  feature  feature     1     4     2     4     0
0
votes

Try:

df=df.loc[df.le(0.05).any(1),['A','B',*df.columns[df.le(0.05).any(0)]]]
#OR
df=df.loc[df.le(0.05).any(1),(df.le(0.05).any(0)) | (df.columns.isin(['A','B']))]

output of df:

    A   B   D       E
0   1   10  0.80    0.04
2   2   5   0.01    0.30