0
votes

This is an extension of my previous question enter link description here

I have two dataframes df1 and df2 of different lengths and two columns as key columns. I would like to perform multiple operations on these dataframes as follows:

  1. Replacing only the blanks (NAs) cells in df1 with corresponding values from df2 based on the key columns
  2. for each key columns pair, the cells from both the dataframes where values are contradicting should be reported in a new dataframe

df1

id_col1   id_col2   name    age    sex
---------------------------------------
101         1M              21  
101         3M              21      M
102         1M      Mark    25  

df2

id_col1    id_col2    name     age     sex
-------------------------------------------
101          1M       Steve             M
101          2M                         M
101          3M       Steve    25   
102          1M       Ria      25       M
102          2M       Anie     22       F

After performing operation 1, i.e. replacing NA's in df1 with the corresponding values from df2, I should get the following:

result_1

id_col1    id_col2    name     age     sex
-------------------------------------------
101         1M        Steve    21      M
101         3M        Steve    25      M
102         1M        Mark     25      M

After performing operation 2, i.e. conflicting cells in df1 and df2 for the same key columns, I should get the following:

result_2

id_col1    id_col2    name     age     sex
-------------------------------------------
101          3M                21   
101          3M                25   
102          1M        Mark     
102          1M        Ria      

Can anyone help in solving these?

2

2 Answers

1
votes

Using df1

enter image description here

df2

enter image description here

Merge

df3=df2.merge(df1, left_index=True,right_index=True,suffixes=('_left', ''), how='left')

Solution 1, use np.where to transfer details and drop rows not required

df3['name']=np.where(df3['name'].isna(),df3['name_left'],df3['name'])
df3['sex']=np.where(df3['sex_left'].isna(),df3['sex'],df3['sex_left'])
df4=df3[df3.index.isin(df1.index)].iloc[:,-3::]

Outcome

enter image description here

You are not so clear on the conflicts and so I assummed conflicts on Name and Age. Fr that reason, I obviusly drop NaNs in age because they exist where I didnt fill them

df3=df3.dropna(subset=['age','age_left'])

Derive dataframe on series of boolean selection

df3[(df3['name_left']!=df3['name'])& df3['age_left']!=df3['age']].dropna(thresh=1, inplace=True)
1
votes

The approach I would take is very similar to the answer to the original problem.

Set the id columns as index, use combine_first, as in the original post. However, since combine_first returns the union of both the dataframes' labels (i.e. rows and columns), after its application, select only those indexes belonging to df1

idx = ['id_col1', 'id_col2']
df1 = df1.set_index(idx)
df2 = df2.set_index(idx)
result_1 = df1.combine_first(df2).loc[df1.index]
# result_1 outputs:
                  name   age sex
id_col1 id_col2
101     1M       Steve  21.0   M
        3M       Steve  21.0   M
102     1M        Mark  25.0   M

To generate result_2, first follow the approach as suggested in the earlier post:

mask = pd.notnull(df1) & ~df1.eq(df2) & pd.notnull(df2)
result_2 = pd.concat([df1[mask], df2[mask]]).dropna(how='all')

This generates the desired data, however the order is slightly off from what you present because df1[mask] is stacked on top of df2[mask].

sort by index to get the final result:

result_2.sort_index()
# outputs 
                 name   age  sex
id_col1 id_col2
101     3M        NaN  21.0  NaN
        3M        NaN  25.0  NaN
102     1M       Mark   NaN  NaN
        1M        Ria   NaN  NaN

The only difference in this solution when compared to the solution presented in the first post is an extra ...loc[df1.index] and result_2.sort_index()