Goal:
I want to merge two dataframes df1 and df2 with a tolerance in an efficient way using python. df1 has shape (l, 2) and df2 has shape (p, 13) with l < m < p. My target dataframe df3 with shape (m, 13) is supposed to contain all matches within the tolerance and not only the closest match.
I want to merge Col0 of df1 with Col2 of df2 with a tolerance "tolerance".
Example:
df1:
Index, Col0, Col1
0, 1008.5155, n01
df2:
Index, Col0, Col1, Col2, Col3, Col4, Col5, Col6, ...
0, 0, 0, 510.0103, k03, 0, k05, k06, ...
1, 0, 0, 1007.6176, k13, 0, k15, k16, ...
2, 0, 0, 1008.6248, k123, 0, k25, k26, ...
df3:
Index, Col0, Col1, Col2, Col3, Col4, Col5, Col6, ...
0, 1008.5155, 0.8979, 1007.6176, k03, n01, k05, k06, ...
1, 1008.5155, 0.1093, 1008.6248, k13, n01, k15, k16, ...
To visualize, col1 of df3 gives me the difference of the respective value of df1 and df2. Hence, it has to be smaller than the tolerance.
My current solution takes a lot of time and requires a lot of memory.
# Create empty list to collect matches
df3_list = []
df3_array = np.asarray(df3_list)
# loops to find matches. Fills array with matches
df3_row = np.asarray([0.0, 0.0, 0.0, 0.0, 0.0, 0, 0, 0, 0, 0, 0, 0, 0])
for n in range(len(df1)):
for k in range(len(df2)):
if abs(df1.iloc[n,0]-df2.iloc[k,2]) < tolerance:
df3_row[0] = df1.iloc[n,0]
df3_row[1] = abs(df1.iloc[n,0]-df2.iloc[k,2])
df3_row[2] = df2.iloc[k,2]
df3_row[3] = df2.iloc[k,3]
df3_row[4] = df1.iloc[n,1]
df3_row[5] = df2.iloc[k,5]
.
.
.
df3_array = np.append(df3_array, df3_row)
# convert list into dataframe
df3 = pd.DataFrame(df3_array.T.reshape(-1,13), columns = header)
I have also tried to get both indices in one go with
[[n, k] for n, k in zip(range(len(df1)), range(len(df2))) if abs(df1.iloc[n,0]-df2.iloc[k,2]) < tolerance]
However, it only gives me an empty array, so I am doing it wrong.
For the respective arrays, I have also tried to use
np.nonzero(np.isclose(df2_array[:, 2], df1_array[:,:,None], atol=tolerance))[-1]
However, np.isclose + np.nonzero only got me indices of df2 and also many more than with my loop-intensive approach. Without the corresponding indices of df1, I am kind of lost. I think this last approach is the most promising, yet I seem unable to merge the data set because the values are no exact match and because the closest match is not always the correct solution. Any ideas how to overcome this problem?