Pyspark, merging multiple dataframes (outer join) and keeping only a single occurance of the primary key (joined on the basis of two columns / key)

Question

I have two data frames df2

+----------+-------------------+-------------------+-------------+
|Event_Type|              start|                end|agg_sum_10_15|
+----------+-------------------+-------------------+-------------+
|    event1|2016-04-25 05:30:00|2016-05-02 05:30:00|          1.0|
|    event1|2016-05-09 05:30:00|2016-05-16 05:30:00|          3.0|
|    event1|2016-06-06 05:30:00|2016-06-13 05:30:00|          3.0|
|    event2|2016-05-09 05:30:00|2016-05-16 05:30:00|          1.0|
|    event2|2016-06-06 05:30:00|2016-06-13 05:30:00|          1.0|
|    event3|2016-05-16 05:30:00|2016-05-23 05:30:00|          1.0|
|    event3|2016-06-13 05:30:00|2016-06-20 05:30:00|          1.0|
+----------+-------------------+-------------------+-------------+

and df3

+----------+-------------------+-------------------+--------------+
|Event_Type|              start|                end|agg_sum_15_110|
+----------+-------------------+-------------------+--------------+
|    event1|2016-04-25 05:30:00|2016-05-02 05:30:00|           1.0|
|    event1|2016-05-30 05:30:00|2016-06-06 05:30:00|           1.0|
|    event2|2016-05-02 05:30:00|2016-05-09 05:30:00|           2.0|
|    event2|2016-05-16 05:30:00|2016-05-23 05:30:00|           2.0|
|    event3|2016-05-02 05:30:00|2016-05-09 05:30:00|          11.0|
|    event3|2016-05-23 05:30:00|2016-05-30 05:30:00|           1.0|
+----------+-------------------+-------------------+--------------+

there could be several data frames, the keys / columns on the basis of which the match is to be done is 'Event_type' and 'start', when I join them (outer join), the columns get repeated. Is there a way to have the column only once, fill null values where ever there is no match

Purpose of outer join, whenever there is match (based on keys) there should be single row, and in case there is no match additional rows get added (null values for the missing values)

using the following code for join

dftotal = df2.join(df3,((df2.Event_Type == df3.Event_Type) & (df2.start == df3.start )), 'outer'). The above code gives the following output

+----------+-------------------+-------------------+-------------+----------+-------------------+-------------------+--------------+
|Event_Type|              start|                end|agg_sum_10_15|Event_Type|              start|                end|agg_sum_15_110|
+----------+-------------------+-------------------+-------------+----------+-------------------+-------------------+--------------+
|      null|               null|               null|         null|    event3|2016-05-23 05:30:00|2016-05-30 05:30:00|           1.0|
|    event2|2016-05-09 05:30:00|2016-05-16 05:30:00|          1.0|      null|               null|               null|          null|
|    event1|2016-05-09 05:30:00|2016-05-16 05:30:00|          3.0|      null|               null|               null|          null|
|    event3|2016-05-16 05:30:00|2016-05-23 05:30:00|          1.0|      null|               null|               null|          null|
|      null|               null|               null|         null|    event1|2016-05-30 05:30:00|2016-06-06 05:30:00|           1.0|
|      null|               null|               null|         null|    event2|2016-05-02 05:30:00|2016-05-09 05:30:00|           2.0|
|      null|               null|               null|         null|    event3|2016-05-02 05:30:00|2016-05-09 05:30:00|          11.0|
|    event2|2016-06-06 05:30:00|2016-06-13 05:30:00|          1.0|      null|               null|               null|          null|
|    event3|2016-06-13 05:30:00|2016-06-20 05:30:00|          1.0|      null|               null|               null|          null|
|      null|               null|               null|         null|    event2|2016-05-16 05:30:00|2016-05-23 05:30:00|           2.0|
|    event1|2016-06-06 05:30:00|2016-06-13 05:30:00|          3.0|      null|               null|               null|          null|
|    event1|2016-04-25 05:30:00|2016-05-02 05:30:00|          1.0|    event1|2016-04-25 05:30:00|2016-05-02 05:30:00|           1.0|
+----------+-------------------+-------------------+-------------+----------+-------------------+-------------------+--------------+

I want a single 'Event_type' column. The null values in first 'Event_Type' gets the value from the second 'Event_type' column, similarly for the start field. Hope it explains the required output I read somewhere that 'coalesce' command may be helpful

@PrathikKini I have added a description of the output which I am looking for, hope it is helpful — Devarshi Mandal

Jim Todd Jim Todd · Accepted Answer · 2019-07-13T13:11:33

You are right. Coalesce is the one you are searching for.

    >>> from pyspark.sql.functions import *
    >>> dftotal = df2.join(df3,((df2.Event_Type == df3.Event_Type) & (df2.start == df3.start )), 'outer').select(coalesce(df2.Event_Type,df3.Event_Type),coalesce(df2.start,df3.start),df2.end,df2.agg_sum_10_15,df3.end,df3.agg_sum_15_110)
    >>> dftotal.show()
    +--------------------------------+----------------------+-------------------+-------------+-------------------+--------------+
    |coalesce(Event_Type, Event_Type)|coalesce(start, start)|                end|agg_sum_10_15|                end|agg_sum_15_110|
    +--------------------------------+----------------------+-------------------+-------------+-------------------+--------------+
    |                          event1|   2016-05-09 05:30:00|2016-05-16 05:30:00|          3.0|               null|          null|
    |                          event1|   2016-06-06 05:30:00|2016-06-13 05:30:00|          3.0|               null|          null|
    |                          event2|   2016-05-02 05:30:00|               null|         null|2016-05-09 05:30:00|           2.0|
    |                          event3|   2016-05-02 05:30:00|               null|         null|2016-05-09 05:30:00|          11.0|
    |                          event2|   2016-05-16 05:30:00|               null|         null|2016-05-23 05:30:00|           2.0|
    |                          event1|   2016-05-30 05:30:00|               null|         null|2016-06-06 05:30:00|           1.0|
    |                          event3|   2016-05-16 05:30:00|2016-05-23 05:30:00|          1.0|               null|          null|
    |                          event2|   2016-06-06 05:30:00|2016-06-13 05:30:00|          1.0|               null|          null|
    |                          event1|   2016-04-25 05:30:00|2016-05-02 05:30:00|          1.0|2016-05-02 05:30:00|           1.0|
    |                          event3|   2016-06-13 05:30:00|2016-06-20 05:30:00|          1.0|               null|          null|
    |                          event3|   2016-05-23 05:30:00|               null|         null|2016-05-30 05:30:00|           1.0|
    |                          event2|   2016-05-09 05:30:00|2016-05-16 05:30:00|          1.0|               null|          null|
    +--------------------------------+----------------------+-------------------+-------------+-------------------+--------------+

Pyspark, merging multiple dataframes (outer join) and keeping only a single occurance of the primary key (joined on the basis of two columns / key)

1 Answers