1
votes

I have a dataframe which looks like this

pd.DataFrame({'A': ['C1', 'C1', 'C1', 'C1', 'C2', 'C2', 'C3', 'C3'],
   ...:                    'date': [date(2019, 12, 31), date(2018, 12, 31), date(2017, 12, 31), date(2016, 12, 31), date(2017, 12, 31), date(2016, 12, 31), date(2018, 12, 31), date(2016, 12, 31)],
   ...:                    'value': [9, 9, 8, 4, 8, 3, 6, 4]})
Out[13]: 
    A        date  value
0  C1  2019-12-31      9
1  C1  2018-12-31      9
2  C1  2017-12-31      8
3  C1  2016-12-31      4
4  C2  2017-12-31      8
5  C2  2016-12-31      3
6  C3  2018-12-31      6
7  C3  2016-12-31      4

first_year = date(2016, 12, 31)
last_year = date(2019, 12, 31)

For each group I need to add the missing years for each group in column 'A' and take the 'value' of the previous year. I would like to say via input variable what my first and last year should be. my resulting dataframe should look like this

     A        date  value
 0  C1  2019-12-31      9
 1  C1  2018-12-31      9
 2  C1  2017-12-31      8
 3  C1  2016-12-31      4
 4  C2  2019-12-31      8
 5  C2  2018-12-31      8
 6  C2  2017-12-31      8
 7  C2  2016-12-31      3
 8  C3  2019-12-31      6
 9  C3  2018-12-31      6
10  C3  2017-12-31      4
11  C3  2016-12-31      4

following logic applies (by group in column A)

C1 = all years between 2016 and 2019 available already

C2 = years 2018 and 2019 missing, need to be added and get value from last available year in 2017 value = 8

C3 = year 2017 missing, gets value from year 2016. and year 2019 missing, gets value from 2018

2

2 Answers

2
votes

IIUC, you can it do it like this:

idx = pd.MultiIndex.from_product([df['A'].unique(), 
                                  pd.date_range(first_year, 
                                                last_year, 
                                                freq='A')], 
                                 names=['A','date'])

df.set_index(['A','date'])\
  .reindex(idx)\
  .groupby(level=0)\
  .ffill()\
  .sort_index(level=[0,1], ascending=[True, False])\
  .reset_index()

Output:

     A       date  value
0   C1 2019-12-31    9.0
1   C1 2018-12-31    9.0
2   C1 2017-12-31    8.0
3   C1 2016-12-31    4.0
4   C2 2019-12-31    8.0
5   C2 2018-12-31    8.0
6   C2 2017-12-31    8.0
7   C2 2016-12-31    3.0
8   C3 2019-12-31    6.0
9   C3 2018-12-31    6.0
10  C3 2017-12-31    4.0
11  C3 2016-12-31    4.0

Create a product of your 'A' and date range, using pd.MultiIndex.from_product. Using that index, set the index of or your dataframe and reindex with the created index from the product. Lastly, ffill forward fill and resort the dataframe then reset_index.

2
votes

Another possible idea using groupby + groupby.apply along with reindex + ffill:

i = pd.date_range(first_year, last_year, freq='Y', name='date')
df = df.set_index('date').groupby('A',group_keys=False)\
       .apply(lambda s: s.reindex(i).ffill()).reset_index()

Result:

         date   A  value
0  2016-12-31  C1    4.0
1  2017-12-31  C1    8.0
2  2018-12-31  C1    9.0
3  2019-12-31  C1    9.0
4  2016-12-31  C2    3.0
5  2017-12-31  C2    8.0
6  2018-12-31  C2    8.0
7  2019-12-31  C2    8.0
8  2016-12-31  C3    4.0
9  2017-12-31  C3    4.0
10 2018-12-31  C3    6.0
11 2019-12-31  C3    6.0