Find difference of values on two dataframe for each corresponding columns using pyspark

Question

I want to find the differences in the column values of two dataframes when joined using inner join.

df1 has 10 columns, ie. key1,key2 & col1, col2 so on. (columns can be more and name can be different) similarly df2 has 10 columns, i,e, key1,key2 & col1, col2 so on.

df3 = df1.join(df2, 'df1.key1 == df2.key1 and df1.key2 == df2.key2', 'inner')

now I want to compare corresponding columns of two dataframe df1 and df2 that is already there in the joined df3.

Now I am looping it for each x,y in zip(df1.columns,df2.columns) and storing in a list unmatchList.append((df3.select(df1.x,df2.y).filter(df1.x <> df2.y)))

can I avoid this loop as this is extensively using memory here. There are other calculation that I am doing but this is small code snippet that I have presented. Idea behind this is to find out different values in the corresponding columns for matching row of two dataframe. exceptAll is don't work for this requirement as it find the difference based on position of columns. I need to find the difference only when keys of both dataframes matches.

df1

key1 key2 col1 col2 col3 col4 col5

k11  k21   1    1    1    1    1

k12  k22   2    2    2    2    2

df2

key1 key2 col1 col2 col3 col4 col5

k11  k21   1    1    2    1    1

k12  k22   2    3    2    3    4

Final output i want is

key1 key2 col  val1 val2

k11  k21  col3 1    2

k12  k22  col2 2    3

k12  k22  col4 2    3

k12  k22  col5 2    4

val1 is to obtain from df1 and val2 is to obtain from df2

Dataplay007 Dataplay007 · Accepted Answer · 2020-06-04T17:52:36

The problem here is, if no of columns in a DataFrame is high, performance of the loop, degrades. It further results in output memory results.

Instead of storing the results in a list we can use a dataframe and store (append or insert into) the results of each iteration in some hdfs location or hive table.

for x,y in zip(df1.columns,df2.columns)
    outputDF=joinedDF.filter(col(x) <> col(y))
                 .withColumns('key1',lit(key1))
                 .withColumns('key2',lit(key2))
                 .withColumns('col',lit(x))
                 .withColumns('val1',col(x))
                 .withColumns('val2',col(y))

    outputDF.partitionBy(x).coalesce(1).write.mode('append').format('hive').saveAsTable('DB.Table')````

#Another approach can be if no of columns are less (10-15):#
    outputDF=outputDF.union(joinedDF.filter(col(x) <> col(y))
                 .withColumns('key1',lit(key1))
                 .withColumns('key2',lit(key2))
                 .withColumns('col',lit(x))
                 .withColumns('val1',col(x))
                 .withColumns('val2',col(y)))

outputDF.partitionBy(x).coalesce(1).write.mode('append').format('hive').saveAsTable('DB.Table')

Find difference of values on two dataframe for each corresponding columns using pyspark

1 Answers