I have two data frames df2
+----------+-------------------+-------------------+-------------+
|Event_Type| start| end|agg_sum_10_15|
+----------+-------------------+-------------------+-------------+
| event1|2016-04-25 05:30:00|2016-05-02 05:30:00| 1.0|
| event1|2016-05-09 05:30:00|2016-05-16 05:30:00| 3.0|
| event1|2016-06-06 05:30:00|2016-06-13 05:30:00| 3.0|
| event2|2016-05-09 05:30:00|2016-05-16 05:30:00| 1.0|
| event2|2016-06-06 05:30:00|2016-06-13 05:30:00| 1.0|
| event3|2016-05-16 05:30:00|2016-05-23 05:30:00| 1.0|
| event3|2016-06-13 05:30:00|2016-06-20 05:30:00| 1.0|
+----------+-------------------+-------------------+-------------+
and df3
+----------+-------------------+-------------------+--------------+
|Event_Type| start| end|agg_sum_15_110|
+----------+-------------------+-------------------+--------------+
| event1|2016-04-25 05:30:00|2016-05-02 05:30:00| 1.0|
| event1|2016-05-30 05:30:00|2016-06-06 05:30:00| 1.0|
| event2|2016-05-02 05:30:00|2016-05-09 05:30:00| 2.0|
| event2|2016-05-16 05:30:00|2016-05-23 05:30:00| 2.0|
| event3|2016-05-02 05:30:00|2016-05-09 05:30:00| 11.0|
| event3|2016-05-23 05:30:00|2016-05-30 05:30:00| 1.0|
+----------+-------------------+-------------------+--------------+
there could be several data frames, the keys / columns on the basis of which the match is to be done is 'Event_type' and 'start', when I join them (outer join), the columns get repeated. Is there a way to have the column only once, fill null values where ever there is no match
Purpose of outer join, whenever there is match (based on keys) there should be single row, and in case there is no match additional rows get added (null values for the missing values)
using the following code for join
dftotal = df2.join(df3,((df2.Event_Type == df3.Event_Type) & (df2.start == df3.start )), 'outer'). The above code gives the following output
+----------+-------------------+-------------------+-------------+----------+-------------------+-------------------+--------------+
|Event_Type| start| end|agg_sum_10_15|Event_Type| start| end|agg_sum_15_110|
+----------+-------------------+-------------------+-------------+----------+-------------------+-------------------+--------------+
| null| null| null| null| event3|2016-05-23 05:30:00|2016-05-30 05:30:00| 1.0|
| event2|2016-05-09 05:30:00|2016-05-16 05:30:00| 1.0| null| null| null| null|
| event1|2016-05-09 05:30:00|2016-05-16 05:30:00| 3.0| null| null| null| null|
| event3|2016-05-16 05:30:00|2016-05-23 05:30:00| 1.0| null| null| null| null|
| null| null| null| null| event1|2016-05-30 05:30:00|2016-06-06 05:30:00| 1.0|
| null| null| null| null| event2|2016-05-02 05:30:00|2016-05-09 05:30:00| 2.0|
| null| null| null| null| event3|2016-05-02 05:30:00|2016-05-09 05:30:00| 11.0|
| event2|2016-06-06 05:30:00|2016-06-13 05:30:00| 1.0| null| null| null| null|
| event3|2016-06-13 05:30:00|2016-06-20 05:30:00| 1.0| null| null| null| null|
| null| null| null| null| event2|2016-05-16 05:30:00|2016-05-23 05:30:00| 2.0|
| event1|2016-06-06 05:30:00|2016-06-13 05:30:00| 3.0| null| null| null| null|
| event1|2016-04-25 05:30:00|2016-05-02 05:30:00| 1.0| event1|2016-04-25 05:30:00|2016-05-02 05:30:00| 1.0|
+----------+-------------------+-------------------+-------------+----------+-------------------+-------------------+--------------+
I want a single 'Event_type' column. The null values in first 'Event_Type' gets the value from the second 'Event_type' column, similarly for the start field. Hope it explains the required output I read somewhere that 'coalesce' command may be helpful