Creating a column in a pandas dataframe containing the number elements in the group (groupby)

Question

I'm analysis a large dataset containing a variable number of observations per subject (ranging from 1 occurrence to 26 occurrences...). As I would need to analyse the time between events, the subjects with only one occurrence are non-informative.

Previously, while working in Stata I would assign a variable (called eg. total) using Stata code:

by idnummer, sort: gen total=_N

In this way every line/subject has a variable 'total' and I could eliminate all subjects total=1.

I have been trying with agg functions and with size but I end up with 'NaN'...

PS: using the "similar questions" on the side I have found the answer to my own question....

df['total'] = df.groupby('idnummer')['sequence'].transform('max')

piRSquared piRSquared · Accepted Answer · 2017-06-25T20:32:38

First and foremost, your question is confusing. Consider editing it to make it clear.

Second, IIUC, you want to eliminate rows containing values within a column that only appear in that column once.

Setup
Consider the sample data in the dataframe df

import pandas as pd
import numpy as np
from string import ascii_uppercase

np.random.seed([3,1415])
df = pd.DataFrame(dict(mycol=np.random.choice(list(ascii_uppercase), 50)))

pd.value_counts
We can use the frequency of each element of column mycol in this solution and others.

vc = df.mycol.value_counts()
vc

N    5
H    4
X    4
W    4
L    3
M    3
A    3
T    3
F    2
Z    2
E    2
S    2
C    2
D    2
Y    2
U    2
Q    1
G    1
K    1
P    1
I    1
Name: mycol, dtype: int64

Option 1
pd.value_counts and map

We can see that ['Q', 'G', 'K', 'P', 'I'] are all single occurrences. Use map to convert mycol to the relative counts and filter.

df[df.mycol.map(vc) > 1]

Option 2
np.bincount and np.unique

f = np.unique(df.mycol.values, return_inverse=True)[1]
df[np.bincount(f)[f] > 1]

Creating a column in a pandas dataframe containing the number elements in the group (groupby)

2 Answers