Changing Shape and Structure of Data

Question

I have the following dataset

Office	Employee ID	Joining Date	Attrition Date
AA	700237	27-11-2017
AA	700238	11-01-2018
AA	700252	14-02-2018	08-04-2018
AB	700287	18-01-2014
AB	700449	28-02-2014	17-04-2014

The idea is to add in Active Column if somebody joins and deduct if somebody resigns in any month and therefore want to change it in the following format using python

Office	Month & Year	Active
AA	Jan-17	0
AA	Feb-17	0
AA	Mar-17	0
AA	Apr-17	0
AA	May-17	0
AA	Jun-17	0
AA	Jul-17	0
AA	Aug-17	0
AA	Sep-17	0
AA	Oct-17	0
AA	Nov-17	1
AA	Dec-17	1
AA	Jan-18	2
AA	Feb-18	3
AA	Mar-18	3
AA	Apr-18	2
AB	Jan-14	1
AB	Feb-14	2
AB	Mar-14	2
AB	Apr-14	1

Please help.

populating data from starting of the year of the first joining date and going till last instant of joining/attrition for each office..... — user9855045
make sense. there are only 28 days in february -> this is wrong 31-02-2014 the last line in your dataframe — sammywemmy

jezrael jezrael · Accepted Answer · 2021-03-08T07:22:31

Use:

#convert columns to datetimes
df['Joining Date'] = pd.to_datetime(df['Joining Date'], dayfirst=True)
df['Attrition Date'] = pd.to_datetime(df['Attrition Date'], dayfirst=True)

#add new rows by first January of minimal year per groups
df1 = df.groupby('Office')['Joining Date'].min() - pd.offsets.DateOffset(month=1, day=1)
df = df.append(df1.reset_index()).sort_values(['Office','Joining Date'])


#replace missing values in Attrition Date by maximal date with next month
#replace missing values in Joining  Date by maximal date with next month
next_month = (df.groupby('Office')['Attrition Date'].transform('max') + 
               pd.offsets.DateOffset(months=1))
next_month1 = (df.groupby('Office')['Joining Date'].transform('max') + 
               pd.offsets.DateOffset(months=1))

df['Attrition Date'] = df['Attrition Date'].fillna(next_month).fillna(next_month1)


#explode start and end datetimes converted to months with years
f = lambda x: pd.date_range(x['Joining Date'], 
                            x['Attrition Date'], freq='M').strftime('%b-%y')
df['Month & Year'] = df.apply(f, axis=1)

#count number of Employee ID with omit missing values
df = (df.explode('Month & Year')
        .groupby(['Office','Month & Year'], sort=False)['Employee ID']
        .count()
        .reset_index(name='Active'))

print (df)
   Office Month & Year  Active
0      AA       Jan-17       0
1      AA       Feb-17       0
2      AA       Mar-17       0
3      AA       Apr-17       0
4      AA       May-17       0
5      AA       Jun-17       0
6      AA       Jul-17       0
7      AA       Aug-17       0
8      AA       Sep-17       0
9      AA       Oct-17       0
10     AA       Nov-17       1
11     AA       Dec-17       1
12     AA       Jan-18       2
13     AA       Feb-18       3
14     AA       Mar-18       3
15     AA       Apr-18       2
16     AB       Jan-14       1
17     AB       Feb-14       2
18     AB       Mar-14       2
19     AB       Apr-14       1

Changing Shape and Structure of Data

1 Answers