My data is like this:
df = pd.DataFrame({'ym': [202012, 202009, 202012, 202101, 202103, 202104, 202012, 202101, 202102, 202103, 202105, 202009, 202012, 202101, 202011, 202104], 'user_id': ['049', '009', '009', '009', '009', '009', '130', '130', '130', '130', '130', '166', '166', '166', '282', '282']})
df
ym user_id
0 202012 049
1 202009 009
2 202012 009
3 202101 009
4 202103 009
5 202104 009
6 202012 130
7 202101 130
8 202102 130
9 202103 130
10 202105 130
11 202009 166
12 202012 166
13 202101 166
14 202011 282
15 202104 282
I want to count apearances on year-month leave for each user, and I have tried this nasty way to get that:
#get a shift column of current year month column
df['ym_s'] = df.sort_values(['user_id','ym']).groupby('user_id')['ym'].shift(-1)
df = df.sort_values(['user_id','ym'])
#get diff value of year month
df['cntn_vl'] = df['ym_s']- df['ym']
#pick out the consecutive part of in my data: normal consecutive month = 1 or dec-jan value
df['rank'] = np.where(df['cntn_vl'].isin([1,89]), 1, 0)
#finally aggregate the consecutive appearance of each user
df_1 = df.groupby(by =['user_id'],as_index=False)['rank'].sum()
df_1
user_id rank
0 009 2
1 049 0
2 130 3
3 166 1
4 282 0
My question is, is there a better/fast(in my billions rows data, it's so hard to run) way to solve it.