0
votes

I'm analysis a large dataset containing a variable number of observations per subject (ranging from 1 occurrence to 26 occurrences...). As I would need to analyse the time between events, the subjects with only one occurrence are non-informative.

Previously, while working in Stata I would assign a variable (called eg. total) using Stata code:

by idnummer, sort: gen total=_N

In this way every line/subject has a variable 'total' and I could eliminate all subjects total=1.

I have been trying with agg functions and with size but I end up with 'NaN'...

PS: using the "similar questions" on the side I have found the answer to my own question....

df['total'] = df.groupby('idnummer')['sequence'].transform('max')

2

2 Answers

0
votes

First and foremost, your question is confusing. Consider editing it to make it clear.

Second, IIUC, you want to eliminate rows containing values within a column that only appear in that column once.

Setup
Consider the sample data in the dataframe df

import pandas as pd
import numpy as np
from string import ascii_uppercase

np.random.seed([3,1415])
df = pd.DataFrame(dict(mycol=np.random.choice(list(ascii_uppercase), 50)))

pd.value_counts
We can use the frequency of each element of column mycol in this solution and others.

vc = df.mycol.value_counts()
vc

N    5
H    4
X    4
W    4
L    3
M    3
A    3
T    3
F    2
Z    2
E    2
S    2
C    2
D    2
Y    2
U    2
Q    1
G    1
K    1
P    1
I    1
Name: mycol, dtype: int64

Option 1
pd.value_counts and map

We can see that ['Q', 'G', 'K', 'P', 'I'] are all single occurrences. Use map to convert mycol to the relative counts and filter.

df[df.mycol.map(vc) > 1]

Option 2
np.bincount and np.unique

f = np.unique(df.mycol.values, return_inverse=True)[1]
df[np.bincount(f)[f] > 1]
0
votes

You don't actually need groupby for this, it's a little simpler to just count the occurances of each string:

df['total'] = df.idnumber.apply(lambda x: df.idnumber.str.count(x).sum())

Or alternatively you can map the value counts like this:

df['total'] = df.idnumber.map(df.idnumber.value_counts())