match 5 columns of 3 dataframes and create a calculated column

Question

I have 3 dataframes as shown below:

df1:

a  b  c  d  e  f  2020    2021
a1 b1 c1 d1 e1 f1 334.385 340.210
a1 b1 c1 d1 e1 f1 335.385 341.210
a2 b2 c2 d2 e2 f2 344.385 350.210
a4 b2 c4 d4 e4 f4 354.385 360.210

df2:

a  g  h  i  j  k  2020    2021
a1 b1 c1 d1 e1 f1 434.385 440.210
a5 b6 c6 d6 e6 f6 444.385 450.210
a5 b6 c6 d6 e6 f6 445.385 451.210
a4 b2 c4 d4 e4 f4 454.385 460.210
a4 b2 c4 d4 e4 f4 455.385 461.210

df3:

a  l  m  n  o  p  2020    2021
a1 b1 c1 d1 e1 f1 534.385 540.210
a7 b7 c7 d7 e7 f7 544.385 550.210
a4 b2 c4 d4 e4 f4 554.385 560.210

expected output:

a  l  m  n  o  p  2020    2021     new_2021
a1 b1 c1 d1 e1 f1 534.385 540.210  540.210*(340.210/440.210)
a7 b7 c7 d7 e7 f7 544.385 550.210  numpy.nan
a4 b2 c4 d4 e4 f4 554.385 560.210  560.210*((460.210+461.210)/560.210)

explanation:
I want to match all the first 5 string columns of the 3 dataframes and create a new column with few calculations on the year columns. df3 is my reference dataframe and would like to adjust values in the year columns of df3 with the rate of change in df1 and df2.
ex: for the rows where all the 5 columns match, then i want to do df3['new_2021'] = df3['2021'] * (df1['2021'] / df2['2021']).
If there are multiple rows with same values in the first 5 columns. I want to take sum of year column for the calculations as shown in 3rd row of expected output;
and as shown for 2nd row of expected output, if no match is found for all the 5 columns of df3 in either or both of df1 and df2, i want that row to be left null.

How do I do this efficiently? I have very large dataframes.

jezrael jezrael · Accepted Answer · 2021-04-01T05:18:03

You can aggregate sum because there is possible duplicated values in first 5 columns, then set index names by columns names from df5 for same in all DataFrames, so possible divide and multiple:

df1 = df1.groupby(df1.columns[:5].tolist()).sum().rename_axis(df3.columns[:5].tolist())
df2 = df2.groupby(df2.columns[:5].tolist()).sum().rename_axis(df3.columns[:5].tolist())
df3 = df3.groupby(df3.columns[:5].tolist()).sum()

df3['new_2021'] = df3['2021'] * (df1['2021'] / df2['2021'])
print (df3)
                   2020    2021    new_2021
a  l  m  n  o                              
a1 b1 c1 d1 e1  534.385  540.21  836.214303
a4 b2 c4 d4 e4  554.385  560.21  219.002457
a7 b7 c7 d7 e7  544.385  550.21         NaN

EDIT: With duplicated MultiIndex in df3 is it psosible, but need more steps:

print (df3)
    a   l   m   n   o   p     2020    2021
0  a1  b1  c1  d1  e1  f1  534.385  540.21
1  a7  b7  c7  d7  e7  f7  544.385  550.21
2  a4  b2  c4  d4  e4  f4  554.385  560.21
3  a1  b1  c1  d1  e1  f1  534.385  200.00
4  a7  b7  c7  d7  e7  f7  544.385  800.00
5  a4  b2  c4  d4  e4  f4  554.385  500.00

df1 = df1.groupby(df1.columns[:5].tolist()).sum().rename_axis(df3.columns[:5].tolist())
df2 = df2.groupby(df2.columns[:5].tolist()).sum().rename_axis(df3.columns[:5].tolist())

#convert first 5 columns to index and sorting
df3 = df3.set_index(df3.columns[:5].tolist()).sort_index()

#create unique MultiIndex from df3 and change index in df1, df2
mux = pd.MultiIndex.from_frame(df3.index.to_frame().drop_duplicates())
df1 = df1.reindex(mux)
df2 = df2.reindex(mux)
print (df2)
                   2020    2021
a  l  m  n  o                  
a1 b1 c1 d1 e1  434.385  440.21
a4 b2 c4 d4 e4  909.770  921.42
a7 b7 c7 d7 e7      NaN     NaN

df3['new_2021'] = df3['2021'] * (df1['2021'] / df2['2021'])
print (df3)
                 p     2020    2021    new_2021
a  l  m  n  o                                  
a1 b1 c1 d1 e1  f1  534.385  540.21  836.214303
            e1  f1  534.385  200.00  309.588605
a4 b2 c4 d4 e4  f4  554.385  560.21  219.002457
            e4  f4  554.385  500.00  195.464609
a7 b7 c7 d7 e7  f7  544.385  550.21         NaN
            e7  f7  544.385  800.00         NaN

match 5 columns of 3 dataframes and create a calculated column

1 Answers