2
votes

Goal:

I want to merge two dataframes df1 and df2 with a tolerance in an efficient way using python. df1 has shape (l, 2) and df2 has shape (p, 13) with l < m < p. My target dataframe df3 with shape (m, 13) is supposed to contain all matches within the tolerance and not only the closest match.

I want to merge Col0 of df1 with Col2 of df2 with a tolerance "tolerance".

Example:

df1:

Index, Col0, Col1
0, 1008.5155, n01

df2:

Index, Col0, Col1, Col2, Col3, Col4, Col5, Col6, ...
0, 0, 0, 510.0103, k03, 0, k05, k06, ... 
1, 0, 0, 1007.6176, k13, 0, k15, k16, ...
2, 0, 0, 1008.6248, k123, 0, k25, k26, ...

df3:

Index, Col0, Col1, Col2, Col3, Col4, Col5, Col6, ...
0, 1008.5155, 0.8979, 1007.6176, k03, n01, k05, k06, ...
1, 1008.5155, 0.1093, 1008.6248, k13, n01, k15, k16, ...

To visualize, col1 of df3 gives me the difference of the respective value of df1 and df2. Hence, it has to be smaller than the tolerance.

My current solution takes a lot of time and requires a lot of memory.

 # Create empty list to collect matches
df3_list = []
df3_array = np.asarray(df3_list)

# loops to find matches. Fills array with matches
df3_row = np.asarray([0.0, 0.0, 0.0, 0.0, 0.0, 0, 0, 0, 0, 0, 0, 0, 0])

for n in range(len(df1)):
    for k in range(len(df2)):
        if abs(df1.iloc[n,0]-df2.iloc[k,2]) < tolerance:
            df3_row[0] = df1.iloc[n,0]
            df3_row[1] = abs(df1.iloc[n,0]-df2.iloc[k,2])
            df3_row[2] = df2.iloc[k,2]
            df3_row[3] = df2.iloc[k,3]
            df3_row[4] = df1.iloc[n,1]
            df3_row[5] = df2.iloc[k,5]
                       .
                       .
                       .

            df3_array = np.append(df3_array, df3_row)

# convert list into dataframe
df3 = pd.DataFrame(df3_array.T.reshape(-1,13), columns = header)

I have also tried to get both indices in one go with

[[n, k]  for n, k in zip(range(len(df1)), range(len(df2))) if abs(df1.iloc[n,0]-df2.iloc[k,2]) < tolerance]

However, it only gives me an empty array, so I am doing it wrong.

For the respective arrays, I have also tried to use

np.nonzero(np.isclose(df2_array[:, 2], df1_array[:,:,None], atol=tolerance))[-1]

However, np.isclose + np.nonzero only got me indices of df2 and also many more than with my loop-intensive approach. Without the corresponding indices of df1, I am kind of lost. I think this last approach is the most promising, yet I seem unable to merge the data set because the values are no exact match and because the closest match is not always the correct solution. Any ideas how to overcome this problem?

1

1 Answers

1
votes

You need to divide this problem in parts

  1. Find the corresponding close indices
  2. Join the DataFrames on those indices
  3. do your extra calculations

Find the indices

using np.isclose, this is a very simple generator function which yields a DataFrame containing the index of df1 and df2 which are close for each row of df1

def find_close(df1, df1_col, df2, df2_col, tolerance=1):
    for index, value in df1[df1_col].items():
        indices = df2.index[np.isclose(df2[df2_col].values, value, atol=tolerance)]
        s = pd.DataFrame(data={'idx1': index, 'idx2': indices.values})
        yield s

Then we can easily concatenate these to get use a helper DataFrame containing the different indices.

df_idx = pd.concat(find_close(df1, 'Col0', df2, 'Col2'), ignore_index=True)

To test this I added a 2nd record to df1

df1_str = '''Index, Col0, Col1
0, 1008.5155, n01
1, 510, n03'''
  idx1    idx2
0 0   1
1 0   2
2 1   0

Join the DataFrames

using pd.merge

df1_close = pd.merge(df_idx, df1, left_on='idx1', right_index=True).reindex(columns=df1.columns)
df2_close = pd.merge(df_idx, df2, left_on='idx2', right_index=True).reindex(columns=df2.columns)
df_merged = pd.merge(df1_close, df2_close, left_index=True, right_index=True)
  Col0_x  Col1_x  Col0_y  Col1_y  Col2    Col3    Col4    Col5    Col6    ...
0 1008.5155   n01 0   0   1007.6176   k13 0   k15 k16 ...
1 1008.5155   n01 0   0   1008.6248   k123    0   k25 k26 ...
2 510.0   n03 0   0   510.0103    k03 0   k05 k06 ...

Do the extra calculations

You'll need to rename a few columns, and assign the diff between them, but that should be trivial