1
votes

Given two dataframes df1, df2 and df3, how to join them such that df3 timestamps is in between start and end in dataframe df1 and df2.

I have to merge the Job ID to df3 based whether the df3'Timestamp' is in df1 or df2 'Start time' and 'End Time', and also match the Node(No.

df1(1230rows*3 columns)

Node      Start Time      End Time      JobID
A         00:03:50        00:05:45      12345
A         00:06:10        00:07:39      56789
A         00:08:30        00:10:45      34567
.
.
.

df2(1130rows*3 columns)

Node      Start Time      End Time      JobID
B         00:02:30        00:07:35      13579
B         00:08:56        00:09:39      24680
B         00:10:32        00:13:47      14680
.
.
.

df3(4002rows*3 columns)

Node      Timestamp     
A         00:05:42       
A         00:09:50       
A         00:11:27       
B         00:04:48
B         00:09:59
B         00:10:32
.
.
.
.

Expected Output: df3(4002rows*3 columns)

No.       Timestamp       Job ID
A         00:05:42        12345              
A         00:09:50        34567       
A         00:11:27        NaN
B         00:04:48        13579
B         00:09:59        NaN
B         00:10:32        14680
.
.
.
.
2

2 Answers

1
votes

You can use .merge() and filter with .between(), as follows:

df1_3 = df1.merge(df3, on='Node')
df1_3_filtered = df1_3[df1_3['Timestamp'].between(df1_3['Start Time'], df1_3['End Time'])]

df2_3 = df2.merge(df3, on='Node')
df2_3_filtered = df2_3[df2_3['Timestamp'].between(df2_3['Start Time'], df2_3['End Time'])]

df_out = df1_3_filtered.append(df2_3_filtered)[['Node', 'JobID', 'Timestamp']]
df_out = df3.merge(df_out, how='left')

Result:

print(df_out)


  Node Timestamp    JobID
0    A  00:05:42  12345.0
1    A  00:09:50  34567.0
2    A  00:11:27      NaN
3    B  00:04:48  13579.0
4    B  00:09:59      NaN
5    B  00:10:32  14680.0

Edit

If you have multiple dataframes with same structure as df1 and df2 and want to merge with df3, you can do:

Simply put all your dataframes into the list List_dfs below:

List_dfs = [df1, df2]              # put all your dataframes of same structure here

Then, run the codes below. You will get the merged and filtered results of all these dataframes in df_out:

df_all_filtered = pd.DataFrame()   # init. df for acculumating filtered results
for df in List_dfs:
    dfx_3 = df.merge(df3, on='Node')
    dfx_3_filtered = dfx_3[dfx_3['Timestamp'].between(dfx_3['Start Time'], dfx_3['End Time'])]
    df_all_filtered = df_all_filtered.append(dfx_3_filtered)   # append filtered result

df_out = df_all_filtered[['Node', 'JobID', 'Timestamp']]
df_out = df3.merge(df_out, how='left')
0
votes

Another method would be to resample your shift data into seconds then merge on the re-sampled data.

def resample_shifts(dataframe : pd.DataFrame, indices : list,
                    start_col : str, end_col : str) -> pd.DataFrame:
    
    return dataframe.set_index(indices)\
                    .apply(lambda x : pd.date_range(x[start_col], 
                                                    x[end_col],freq='s')
                    ,1).explode().rename('Timestamp').reset_index()

df1a = resample_shifts(df1,
               ['Node','JobID'],
               'Start_Time',
                'End_Time'
               )

df2a = resample_shifts(df2,
               ['Node','JobID'],
               'Start_Time',
                'End_Time'
               )

df3['Timestamp'] = pd.to_datetime(df3['Timestamp'])

df3a = pd.merge(pd.concat([df1a,df2a]),df3,on=['Node','Timestamp'],how='right')

print(df3a)
  Node    JobID           Timestamp
0    A  12345.0 2021-06-28 00:05:42
1    A  34567.0 2021-06-28 00:09:50
2    A      NaN 2021-06-28 00:11:27
3    B  13579.0 2021-06-28 00:04:48
4    B      NaN 2021-06-28 00:09:59
5    B  14680.0 2021-06-28 00:10:32