0
votes

For the problem, I want to calculate the mean for a particular member with a time window of 3 years. For example for the following data frame:

member_id Feature 1 Feature 2 Feature 3 Date
1 0.1 0.5 0.2 1/2/20
1 0.2 0.3 0.3 1/2/18
1 0.3 0.2 0.2 1/2/16
1 0.1 0.2 0.1 1/4/17
2 0.4 0.1 0.4 1/2/18
2 0.5 0.1 0.2 1/2/15

Taking into consideration of member_id = 1, there are different dates of admissions but we would want to consider only those whose time-window lies in the 3-year gap.

So, for the first row, the member_id = 1 with the date 1/2/20 would have the window from 1/2/20 - 1/2/17. Thus, while taking the aggregate value for the feature, rows 1,2, and 4 would be selected excluding the 3rd row where the date is 1/2/16.

I want the first row values to be like:

member_id Feature 1_mean Feature 2_mean Feature 3_mean Date
1 0.13 0.33 0.2 1/2/20
1 0.2 0.3 0.3 1/2/18
1 0.3 0.2 0.2 1/2/16
1 0.1 0.2 0.1 1/4/17
2 0.4 0.1 0.4 1/2/18
2 0.5 0.1 0.2 1/2/15

Now, similarly the same for row values 2 and 3 and so forth. The best way I thought would be to, first group by the members and then sort the data. After that iterate through each row calculating the mean.

I had written a sample code but it grouped it by the date causing multiple additional rows.

df['date'] = pd.to_datetime(df['date'])
results = dict()
for start in df['date']:
    end = start.date() - relativedelta(years=3)
    screen = (df.date <= start) & (df.date >= end)  # screen for 3 years of data
    singles = df.loc[screen, ['Feature 1', 'Feature 2', 'Feature 3' , 'member_id']]  # screen for 3 year window member_id by groups
    x = singles.groupby('member_id').mean()
    results[start] = x
results = pd.concat(results)

What is the best way to achieve the result?

1

1 Answers

0
votes

Proceed as follows:

  1. Define a function to get mean features for particular row from the current group, for the required period:

    def getRowMeans(row, grp):
        dTo = row.Date
        return grp[grp.Date.between(dTo - pd.DateOffset(years=3), dTo)]\
            .loc[:, 'Feature 1' : 'Feature 3'].mean()
    

    The idea is:

    • the current object is a row, from a group of rows (for some member_id),
    • for the operations below the whole group must also be known, so it is another parameter of this function (grp),
    • from grp take rows for 3 years preceding from the Date from the current row (including this date),
    • from these rows take all 3 Feature columns and return their means.
  2. Define a function, to be called for each group of rows (for some member_id), returning a copy of this group with all Feature columns replaced with their means (generated by getRowMeans):

    def FeaturesToMeans(grp):
        means = grp.apply(getRowMeans, axis=1, grp=grp)
        rv = grp.copy()
        rv.update(means)
        return rv
    

    The first step is to compute feature means.

    In order not to alter the original group, the object to be finally returned (rv) must be created as a copy of the original group.

    Then it is updated with the just computed means. Note however that update operates in place and does not return any result.

    The returned object is the updated group.

  3. Generate the actual result, as a new DataFrame:

    result = df.groupby('member_id', group_keys=False).apply(FeaturesToMeans)
    

The result, for your sample data, is:

   member_id  Feature 1  Feature 2  Feature 3       Date
0          1   0.133333   0.333333       0.20 2020-01-02
1          1   0.200000   0.233333       0.20 2018-01-02
2          1   0.300000   0.200000       0.20 2016-01-02
3          1   0.200000   0.200000       0.15 2017-01-04
4          2   0.450000   0.100000       0.30 2018-01-02
5          2   0.500000   0.100000       0.20 2015-01-02