1
votes

I have the following dataset

Office Employee ID Joining Date Attrition Date
AA 700237 27-11-2017
AA 700238 11-01-2018
AA 700252 14-02-2018 08-04-2018
AB 700287 18-01-2014
AB 700449 28-02-2014 17-04-2014

The idea is to add in Active Column if somebody joins and deduct if somebody resigns in any month and therefore want to change it in the following format using python

Office Month & Year Active
AA Jan-17 0
AA Feb-17 0
AA Mar-17 0
AA Apr-17 0
AA May-17 0
AA Jun-17 0
AA Jul-17 0
AA Aug-17 0
AA Sep-17 0
AA Oct-17 0
AA Nov-17 1
AA Dec-17 1
AA Jan-18 2
AA Feb-18 3
AA Mar-18 3
AA Apr-18 2
AB Jan-14 1
AB Feb-14 2
AB Mar-14 2
AB Apr-14 1

Please help.

1
populating data from starting of the year of the first joining date and going till last instant of joining/attrition for each office.....user9855045
make sense. there are only 28 days in february -> this is wrong 31-02-2014 the last line in your dataframesammywemmy

1 Answers

2
votes

Use:

#convert columns to datetimes
df['Joining Date'] = pd.to_datetime(df['Joining Date'], dayfirst=True)
df['Attrition Date'] = pd.to_datetime(df['Attrition Date'], dayfirst=True)

#add new rows by first January of minimal year per groups
df1 = df.groupby('Office')['Joining Date'].min() - pd.offsets.DateOffset(month=1, day=1)
df = df.append(df1.reset_index()).sort_values(['Office','Joining Date'])


#replace missing values in Attrition Date by maximal date with next month
#replace missing values in Joining  Date by maximal date with next month
next_month = (df.groupby('Office')['Attrition Date'].transform('max') + 
               pd.offsets.DateOffset(months=1))
next_month1 = (df.groupby('Office')['Joining Date'].transform('max') + 
               pd.offsets.DateOffset(months=1))

df['Attrition Date'] = df['Attrition Date'].fillna(next_month).fillna(next_month1)


#explode start and end datetimes converted to months with years
f = lambda x: pd.date_range(x['Joining Date'], 
                            x['Attrition Date'], freq='M').strftime('%b-%y')
df['Month & Year'] = df.apply(f, axis=1)

#count number of Employee ID with omit missing values
df = (df.explode('Month & Year')
        .groupby(['Office','Month & Year'], sort=False)['Employee ID']
        .count()
        .reset_index(name='Active'))

print (df)
   Office Month & Year  Active
0      AA       Jan-17       0
1      AA       Feb-17       0
2      AA       Mar-17       0
3      AA       Apr-17       0
4      AA       May-17       0
5      AA       Jun-17       0
6      AA       Jul-17       0
7      AA       Aug-17       0
8      AA       Sep-17       0
9      AA       Oct-17       0
10     AA       Nov-17       1
11     AA       Dec-17       1
12     AA       Jan-18       2
13     AA       Feb-18       3
14     AA       Mar-18       3
15     AA       Apr-18       2
16     AB       Jan-14       1
17     AB       Feb-14       2
18     AB       Mar-14       2
19     AB       Apr-14       1