0
votes

I have two data frames df2

+----------+-------------------+-------------------+-------------+
|Event_Type|              start|                end|agg_sum_10_15|
+----------+-------------------+-------------------+-------------+
|    event1|2016-04-25 05:30:00|2016-05-02 05:30:00|          1.0|
|    event1|2016-05-09 05:30:00|2016-05-16 05:30:00|          3.0|
|    event1|2016-06-06 05:30:00|2016-06-13 05:30:00|          3.0|
|    event2|2016-05-09 05:30:00|2016-05-16 05:30:00|          1.0|
|    event2|2016-06-06 05:30:00|2016-06-13 05:30:00|          1.0|
|    event3|2016-05-16 05:30:00|2016-05-23 05:30:00|          1.0|
|    event3|2016-06-13 05:30:00|2016-06-20 05:30:00|          1.0|
+----------+-------------------+-------------------+-------------+

and df3

+----------+-------------------+-------------------+--------------+
|Event_Type|              start|                end|agg_sum_15_110|
+----------+-------------------+-------------------+--------------+
|    event1|2016-04-25 05:30:00|2016-05-02 05:30:00|           1.0|
|    event1|2016-05-30 05:30:00|2016-06-06 05:30:00|           1.0|
|    event2|2016-05-02 05:30:00|2016-05-09 05:30:00|           2.0|
|    event2|2016-05-16 05:30:00|2016-05-23 05:30:00|           2.0|
|    event3|2016-05-02 05:30:00|2016-05-09 05:30:00|          11.0|
|    event3|2016-05-23 05:30:00|2016-05-30 05:30:00|           1.0|
+----------+-------------------+-------------------+--------------+

there could be several data frames, the keys / columns on the basis of which the match is to be done is 'Event_type' and 'start', when I join them (outer join), the columns get repeated. Is there a way to have the column only once, fill null values where ever there is no match

Purpose of outer join, whenever there is match (based on keys) there should be single row, and in case there is no match additional rows get added (null values for the missing values)

using the following code for join

dftotal = df2.join(df3,((df2.Event_Type == df3.Event_Type) & (df2.start == df3.start )), 'outer'). The above code gives the following output

+----------+-------------------+-------------------+-------------+----------+-------------------+-------------------+--------------+
|Event_Type|              start|                end|agg_sum_10_15|Event_Type|              start|                end|agg_sum_15_110|
+----------+-------------------+-------------------+-------------+----------+-------------------+-------------------+--------------+
|      null|               null|               null|         null|    event3|2016-05-23 05:30:00|2016-05-30 05:30:00|           1.0|
|    event2|2016-05-09 05:30:00|2016-05-16 05:30:00|          1.0|      null|               null|               null|          null|
|    event1|2016-05-09 05:30:00|2016-05-16 05:30:00|          3.0|      null|               null|               null|          null|
|    event3|2016-05-16 05:30:00|2016-05-23 05:30:00|          1.0|      null|               null|               null|          null|
|      null|               null|               null|         null|    event1|2016-05-30 05:30:00|2016-06-06 05:30:00|           1.0|
|      null|               null|               null|         null|    event2|2016-05-02 05:30:00|2016-05-09 05:30:00|           2.0|
|      null|               null|               null|         null|    event3|2016-05-02 05:30:00|2016-05-09 05:30:00|          11.0|
|    event2|2016-06-06 05:30:00|2016-06-13 05:30:00|          1.0|      null|               null|               null|          null|
|    event3|2016-06-13 05:30:00|2016-06-20 05:30:00|          1.0|      null|               null|               null|          null|
|      null|               null|               null|         null|    event2|2016-05-16 05:30:00|2016-05-23 05:30:00|           2.0|
|    event1|2016-06-06 05:30:00|2016-06-13 05:30:00|          3.0|      null|               null|               null|          null|
|    event1|2016-04-25 05:30:00|2016-05-02 05:30:00|          1.0|    event1|2016-04-25 05:30:00|2016-05-02 05:30:00|           1.0|
+----------+-------------------+-------------------+-------------+----------+-------------------+-------------------+--------------+

I want a single 'Event_type' column. The null values in first 'Event_Type' gets the value from the second 'Event_type' column, similarly for the start field. Hope it explains the required output I read somewhere that 'coalesce' command may be helpful

1
Could You please add how output should look like? - Prathik Kini
@PrathikKini I have added a description of the output which I am looking for, hope it is helpful - Devarshi Mandal

1 Answers

2
votes

You are right. Coalesce is the one you are searching for.

    >>> from pyspark.sql.functions import *
    >>> dftotal = df2.join(df3,((df2.Event_Type == df3.Event_Type) & (df2.start == df3.start )), 'outer').select(coalesce(df2.Event_Type,df3.Event_Type),coalesce(df2.start,df3.start),df2.end,df2.agg_sum_10_15,df3.end,df3.agg_sum_15_110)
    >>> dftotal.show()
    +--------------------------------+----------------------+-------------------+-------------+-------------------+--------------+
    |coalesce(Event_Type, Event_Type)|coalesce(start, start)|                end|agg_sum_10_15|                end|agg_sum_15_110|
    +--------------------------------+----------------------+-------------------+-------------+-------------------+--------------+
    |                          event1|   2016-05-09 05:30:00|2016-05-16 05:30:00|          3.0|               null|          null|
    |                          event1|   2016-06-06 05:30:00|2016-06-13 05:30:00|          3.0|               null|          null|
    |                          event2|   2016-05-02 05:30:00|               null|         null|2016-05-09 05:30:00|           2.0|
    |                          event3|   2016-05-02 05:30:00|               null|         null|2016-05-09 05:30:00|          11.0|
    |                          event2|   2016-05-16 05:30:00|               null|         null|2016-05-23 05:30:00|           2.0|
    |                          event1|   2016-05-30 05:30:00|               null|         null|2016-06-06 05:30:00|           1.0|
    |                          event3|   2016-05-16 05:30:00|2016-05-23 05:30:00|          1.0|               null|          null|
    |                          event2|   2016-06-06 05:30:00|2016-06-13 05:30:00|          1.0|               null|          null|
    |                          event1|   2016-04-25 05:30:00|2016-05-02 05:30:00|          1.0|2016-05-02 05:30:00|           1.0|
    |                          event3|   2016-06-13 05:30:00|2016-06-20 05:30:00|          1.0|               null|          null|
    |                          event3|   2016-05-23 05:30:00|               null|         null|2016-05-30 05:30:00|           1.0|
    |                          event2|   2016-05-09 05:30:00|2016-05-16 05:30:00|          1.0|               null|          null|
    +--------------------------------+----------------------+-------------------+-------------+-------------------+--------------+