Pandas: Subset data frame using list, keeping duplicates multiple times

Question

I have a large list which includes duplicate values and I wish to subset a data frame using the list values. Usually I would use the .isin method, but I want to keep duplicate rows. Here is some example code:

df = pd.DataFrame(np.array([[1, 2, 'car'], [4, 5, 'bike'], [1, 2, 'train'], [1, 2, 'car'], [1, 2, 'train']]),columns=['a', 'b', 'c'])

lst = ['car', 'bike', 'car', 'car']

So I want to return a data frame that includes all rows each time they occur. Every time a item occurs in the list, I want to return the corresponding rows.

On a simple dataset such as the above I can loop through the list and append to a new data frame the returned values, but on a large dataset this seems to be taking an extremely long time. Any suggestions?

EDIT: So Chris' suggestion works, and provides the expected output using:

pd.concat([df[df['c'].eq(x)] for x in lst])

However, as with using a loop this is extremely slow when compared to something like the .isin method when working with much larger data. Added this edit so that the expected output can be created.

pd.concat([df[df['c'].eq(x)] for x in lst]) - is this what you mean? — Chris A

jezrael jezrael · Accepted Answer · 2019-03-07T10:27:04

First step is filter only matched values:

df = df[df['c'].isin(lst)]

Then flatten index values what match condition and then use loc for repeat, np.repeat is also possible, but it convert all columns to strings, so not possible use here:

idx = [y for x in lst for y in df.index[df['c'].values == x]]
df_new = df.loc[idx].reset_index(drop=True)
print (df_new)
   a  b     c
0  1  2   car
1  1  2   car
2  4  5  bike
3  1  2   car
4  1  2   car
5  1  2   car
6  1  2   car

If many repeating values is possible filter only once and then repeat index values:

ser = pd.Series(lst)
idx = ser.map({k:df.index[df['c'].values == k] for k, v in ser.value_counts().items()})
df_new = df.loc[list(chain.from_iterable(idx))].reset_index(drop=True)
print (df_new)
   a  b     c
0  1  2   car
1  1  2   car
2  4  5  bike
3  1  2   car
4  1  2   car
5  1  2   car
6  1  2   car

Another solution:

from  itertools import chain
from collections import Counter

d = {k:df.index[df['c'].values == k] for k, v in Counter(lst).items()}
idx = [y for x in lst for y in d[x]]
df_new = df.loc[idx].reset_index(drop=True)

Pandas: Subset data frame using list, keeping duplicates multiple times

3 Answers