3
votes

I have a dataframe which looks like this:

pd.DataFrame({'a': ['cust1', 'cust1', 'cust1', 'cust2', 'cust2', 'cust3', 'cust3', 'cust3', 'cust4', 'cust4'],
                   'year': [2017, 2018, 2019, 2018, 2019, 2017, 2018, 2019, 2018, 2019],
                   'cond': [True, True, False, True, True, True, True, True, True, True],
                   'startDate': [2017, 2017, 2017, 2018, 2018, 2017, 2017, 2017, 2017, 2017]})

endYear = 2019

       a  year   cond  startDate
0  cust1  2017   True       2017
1  cust1  2018   True       2017
2  cust1  2019  False       2017
3  cust2  2018   True       2018
4  cust2  2019   True       2018
5  cust3  2017   True       2017
6  cust3  2018   True       2017
7  cust3  2019   True       2017
8  cust4  2018   True       2017
9  cust4  2019   True       2017

For each group in column 'a' I need to check if column 'cond' has "True" value for all years between column 'startDate' (can be different for each group) and 'endYear'.

My resulting dataframe should look like this:

       a  final_score
0  cust1        False
1  cust2         True
2  cust3         True
3  cust4        False

Logic:

cust1 = False as year 2019 value = False

cust2 = True as startDate 2018 and for all years between 2018 and 2019 'cond' is True

cust3 = True as startDate 2017 and for all years between 2017 and 2019 'cond' is True

cust4 = False as startDate 2017 but no data for 2017 so condition not met

2

2 Answers

6
votes

IIUC, you could use reindex to fill up the empty years by group then check for True:

print (df.groupby("a").apply(lambda d: d.set_index("year").reindex(range(d["startDate"].min(), endYear+1))["cond"].eq(True).all()))

a
cust1    False
cust2     True
cust3     True
cust4    False
dtype: bool
0
votes

Try (Way less readable than Henry's answer - however no lambda has been used to deliver the solution, and it leverages vectorized processing, hence I would expect it to perform better, particularly for large datasets)

import numpy as np

#to filter non-True cond in the range'
df2=df.loc[(df['year'].ge(df['startDate'])&df['year'].le(endYear)&~df['cond']), 'a'].unique()

#to filter the missing values:
res=df.groupby('a', as_index=False).agg({'year': 'nunique', 'startDate': 'min'})

#pipe separates non-True condition and one for missing data
res['final_score']=np.where(np.isin(res['a'].to_numpy(), df2) | ((endYear-res['startDate']+1)-res['year']).abs()<1, True, False)
res.drop(['year', 'startDate'], inplace=True, axis=1)

Outputs:

       a  final_score
0  cust1        False
1  cust2         True
2  cust3         True
3  cust4        False