I want to find the differences in the column values of two dataframes when joined using inner join.
df1 has 10 columns, ie. key1,key2 & col1, col2 so on. (columns can be more and name can be different) similarly df2 has 10 columns, i,e, key1,key2 & col1, col2 so on.
df3 = df1.join(df2, 'df1.key1 == df2.key1 and df1.key2 == df2.key2', 'inner')
now I want to compare corresponding columns of two dataframe df1 and df2 that is already there in the joined df3.
Now I am looping it for each x,y in zip(df1.columns,df2.columns) and storing in a list
unmatchList.append((df3.select(df1.x,df2.y).filter(df1.x <> df2.y)))
can I avoid this loop as this is extensively using memory here. There are other calculation that I am doing but this is small code snippet that I have presented. Idea behind this is to find out different values in the corresponding columns for matching row of two dataframe. exceptAll is don't work for this requirement as it find the difference based on position of columns. I need to find the difference only when keys of both dataframes matches.
df1
key1 key2 col1 col2 col3 col4 col5
k11 k21 1 1 1 1 1
k12 k22 2 2 2 2 2
df2
key1 key2 col1 col2 col3 col4 col5
k11 k21 1 1 2 1 1
k12 k22 2 3 2 3 4
Final output i want is
key1 key2 col val1 val2
k11 k21 col3 1 2
k12 k22 col2 2 3
k12 k22 col4 2 3
k12 k22 col5 2 4
val1 is to obtain from df1 and val2 is to obtain from df2