2
votes

I have a large list which includes duplicate values and I wish to subset a data frame using the list values. Usually I would use the .isin method, but I want to keep duplicate rows. Here is some example code:

df = pd.DataFrame(np.array([[1, 2, 'car'], [4, 5, 'bike'], [1, 2, 'train'], [1, 2, 'car'], [1, 2, 'train']]),columns=['a', 'b', 'c'])

lst = ['car', 'bike', 'car', 'car']

So I want to return a data frame that includes all rows each time they occur. Every time a item occurs in the list, I want to return the corresponding rows.

On a simple dataset such as the above I can loop through the list and append to a new data frame the returned values, but on a large dataset this seems to be taking an extremely long time. Any suggestions?

EDIT: So Chris' suggestion works, and provides the expected output using:

pd.concat([df[df['c'].eq(x)] for x in lst])

However, as with using a loop this is extremely slow when compared to something like the .isin method when working with much larger data. Added this edit so that the expected output can be created.

3
pd.concat([df[df['c'].eq(x)] for x in lst]) - is this what you mean?Chris A
What is the expected output?Dani Mesejo

3 Answers

1
votes

First step is filter only matched values:

df = df[df['c'].isin(lst)]

Then flatten index values what match condition and then use loc for repeat, np.repeat is also possible, but it convert all columns to strings, so not possible use here:

idx = [y for x in lst for y in df.index[df['c'].values == x]]
df_new = df.loc[idx].reset_index(drop=True)
print (df_new)
   a  b     c
0  1  2   car
1  1  2   car
2  4  5  bike
3  1  2   car
4  1  2   car
5  1  2   car
6  1  2   car

If many repeating values is possible filter only once and then repeat index values:

ser = pd.Series(lst)
idx = ser.map({k:df.index[df['c'].values == k] for k, v in ser.value_counts().items()})
df_new = df.loc[list(chain.from_iterable(idx))].reset_index(drop=True)
print (df_new)
   a  b     c
0  1  2   car
1  1  2   car
2  4  5  bike
3  1  2   car
4  1  2   car
5  1  2   car
6  1  2   car

Another solution:

from  itertools import chain
from collections import Counter

d = {k:df.index[df['c'].values == k] for k, v in Counter(lst).items()}
idx = [y for x in lst for y in d[x]]
df_new = df.loc[idx].reset_index(drop=True)
1
votes

IIUC, use pandas.concat with a list comprehension:

df_new = pd.concat([df[df['c'].eq(x)] for x in lst], ignore_index=True)

An alternative approach could be to create a helper Series with value_counts method on your list and reduce original DataFrame size filtering using .isin method:

s = pd.Series(lst).value_counts()
df = df[df['c'].isin(set(lst))]

idx = np.concatenate([df[df['c'].eq(i)].index.repeat(r) for i, r in s.iteritems()])

df_new = df.loc[idx]
0
votes

If I have correctly understood your question, groupby could help:

gr = df.groupby('c')
for i in lst:
    subset = gr.get_group(i)
    # process subset...