I have a dataframe df
:
Serial_no date Index x y
1 2014-01-01 1 2.0 3.0
1 2014-03-01 2 3.0 3.0
1 2014-04-01 3 6.0 2.0
2 2011-03-01 1 5.1 1.3
2 2011-04-01 2 5.8 0.6
2 2011-05-01 3 6.5 -0.1
2 2011-07-01 4 3.0 5.0
3 2019-10-01 1 7.9 -1.5
3 2019-11-01 2 8.6 -2.2
3 2020-01-01 3 10.0 -3.6
3 2020-02-01 4 10.7 -4.3
3 2020-03-01 5 4.0 3.0
Notice:
The data is grouped by Serial_no
and the date
is data reported monthly (first of every month).
The Index
column is set so each consecutive reported date is a consecutive number in the series.
The number of reported dates in each group Serial_no
are different.
The interval of reported dates date
are different for each group Serial_no
(they don't start or end on the same date for each group).
The problem:
There is no reported data for some dates date
in the time series. Notice some dates are missing in each Serial_no
group.
I want to add a row in each group for those missing dates date
and have the data reported in x
and y
columns as 'NaN'.
Example of the dataframe I need:
Serial_no date Index x y
1 2014-01-01 1 2.0 3.0
1 2014-02-01 2 NaN NaN
1 2014-03-01 3 3.0 3.0
1 2014-04-01 4 6.0 2.0
2 2011-03-01 1 5.1 1.3
2 2011-04-01 2 5.8 0.6
2 2011-05-01 3 6.5 -0.1
2 2011-06-01 4 NaN NaN
2 2011-07-01 5 3.0 5.0
3 2019-10-01 1 7.9 -1.5
3 2019-11-01 2 8.6 -2.2
3 2019-12-01 3 NaN NaN
3 2020-01-01 4 10.0 -3.6
3 2020-02-01 5 10.7 -4.3
3 2020-03-01 6 4.0 3.0
I know how to replace the blank cells with NaN
once the rows with missing dates are inserted, using the following code:
import pandas as pd
import numpy as np
df['x'].replace('', np.nan, inplace=True)
df['y'].replace('', np.nan, inplace=True)
I also know how to reset the index once the rows with missing dates are inserted, using the following code:
df["Index"] = df.groupby("Serial_no",).cumcount('date')
However, I'm unsure how to locate the the missing dates in each group and insert the row for those (monthly reported) dates. Any help is appreciated.