For the problem, I want to calculate the mean for a particular member with a time window of 3 years. For example for the following data frame:
member_id | Feature 1 | Feature 2 | Feature 3 | Date |
---|---|---|---|---|
1 | 0.1 | 0.5 | 0.2 | 1/2/20 |
1 | 0.2 | 0.3 | 0.3 | 1/2/18 |
1 | 0.3 | 0.2 | 0.2 | 1/2/16 |
1 | 0.1 | 0.2 | 0.1 | 1/4/17 |
2 | 0.4 | 0.1 | 0.4 | 1/2/18 |
2 | 0.5 | 0.1 | 0.2 | 1/2/15 |
Taking into consideration of member_id = 1, there are different dates of admissions but we would want to consider only those whose time-window lies in the 3-year gap.
So, for the first row, the member_id = 1 with the date 1/2/20 would have the window from 1/2/20 - 1/2/17. Thus, while taking the aggregate value for the feature, rows 1,2, and 4 would be selected excluding the 3rd row where the date is 1/2/16.
I want the first row values to be like:
member_id | Feature 1_mean | Feature 2_mean | Feature 3_mean | Date |
---|---|---|---|---|
1 | 0.13 | 0.33 | 0.2 | 1/2/20 |
1 | 0.2 | 0.3 | 0.3 | 1/2/18 |
1 | 0.3 | 0.2 | 0.2 | 1/2/16 |
1 | 0.1 | 0.2 | 0.1 | 1/4/17 |
2 | 0.4 | 0.1 | 0.4 | 1/2/18 |
2 | 0.5 | 0.1 | 0.2 | 1/2/15 |
Now, similarly the same for row values 2 and 3 and so forth. The best way I thought would be to, first group by the members and then sort the data. After that iterate through each row calculating the mean.
I had written a sample code but it grouped it by the date causing multiple additional rows.
df['date'] = pd.to_datetime(df['date'])
results = dict()
for start in df['date']:
end = start.date() - relativedelta(years=3)
screen = (df.date <= start) & (df.date >= end) # screen for 3 years of data
singles = df.loc[screen, ['Feature 1', 'Feature 2', 'Feature 3' , 'member_id']] # screen for 3 year window member_id by groups
x = singles.groupby('member_id').mean()
results[start] = x
results = pd.concat(results)
What is the best way to achieve the result?