I have 3 dataframes as shown below:
df1:
a b c d e f 2020 2021
a1 b1 c1 d1 e1 f1 334.385 340.210
a1 b1 c1 d1 e1 f1 335.385 341.210
a2 b2 c2 d2 e2 f2 344.385 350.210
a4 b2 c4 d4 e4 f4 354.385 360.210
df2:
a g h i j k 2020 2021
a1 b1 c1 d1 e1 f1 434.385 440.210
a5 b6 c6 d6 e6 f6 444.385 450.210
a5 b6 c6 d6 e6 f6 445.385 451.210
a4 b2 c4 d4 e4 f4 454.385 460.210
a4 b2 c4 d4 e4 f4 455.385 461.210
df3:
a l m n o p 2020 2021
a1 b1 c1 d1 e1 f1 534.385 540.210
a7 b7 c7 d7 e7 f7 544.385 550.210
a4 b2 c4 d4 e4 f4 554.385 560.210
expected output:
a l m n o p 2020 2021 new_2021
a1 b1 c1 d1 e1 f1 534.385 540.210 540.210*(340.210/440.210)
a7 b7 c7 d7 e7 f7 544.385 550.210 numpy.nan
a4 b2 c4 d4 e4 f4 554.385 560.210 560.210*((460.210+461.210)/560.210)
explanation:
I want to match all the first 5 string columns of the 3 dataframes and create a new column with few calculations on the year columns. df3 is my reference dataframe and would like to adjust values in the year columns of df3 with the rate of change in df1 and df2.
ex: for the rows where all the 5 columns match, then i want to do df3['new_2021'] = df3['2021'] * (df1['2021'] / df2['2021']).
If there are multiple rows with same values in the first 5 columns. I want to take sum of year column for the calculations as shown in 3rd row of expected output;
and as shown for 2nd row of expected output, if no match is found for all the 5 columns of df3 in either or both of df1 and df2, i want that row to be left null.
How do I do this efficiently? I have very large dataframes.