Pandas drop rare entries

Question

I'm new to Pandas. To simplify, I have a data frame with two columns: product_id and rating. Each entry is a new review for the given product. Now I want to get a new data frame in which lines corresponding to the product which received less then 20 reviews (ie. appears less then 20 times in the original data frame) are removed. I can count the number of occurences with:

a = data.groupby('product_id').count()
b = a.loc[a['rating']>20]

but that gives me back a 1D data frame. When displayed, each product_id has its count, but I'm unable to access the actual product_id's to use them to filter the original table. For instace,

b.values

gives back a 1D array of the counts, but no the product_ids.

EdChum EdChum · Accepted Answer · 2015-10-30T15:53:44

3

votes

You want to filter:

a = data.groupby('product_id').filter(lambda x: len(x) > 20)

Pandas drop rare entries

1 Answers