0
votes

I'm trying to create a dataframe using the values in two columns in a dataframe , this are how they look:

df1

W          F1        F2
0        'AB CD'   'RS TU'
0        'BC EF'   'GH IJ'
1        'BC EF'   'NO PQ'
0        'GH IJ'   'AB CD'
1        'KL MN'   'RS TU'
0        'NO PQ'   'UV WX'
...

df2

  F1         F2        BO_F1         BO_F2
'AB CD'   'GH IJ'       -150          500  
'BC EF'   'TG IP'        265         -150
'BC EF'   'NO PQ'       -500          250
'BC EF'   'GH IJ'        600         -700
'KL MN'   'PP UY'        150         -600 
'RS TU'   'AB CD'       -400          350
...

What I want:

W          F1        F2        BO_F1         BO_F2
0        'AB CD'   'RS TU'      -400          350
0        'BC EF'   'GH IJ'       600         -700
1        'BC EF'   'NO PQ'      -500          250
0        'GH IJ'   'AB CD'      -150          500
1        'KL MN'   'RS TU'      -600          700
0        'NO PQ'   'UV WX'       350         -900

I want to merge the columns BO_F1 and BO_F2 from df2 to df1 based on the values of columns F1 and F2. The main problem I'm having is that some combinations of values are swapped, for example in df1 the first combination in F1 and F2 is AB CD and RS TU but such values in df2 are swapped, this is, F1 and F2 are RS TU and AB CD.

How can I achieve this?

3
you should clarify your question with a minimum working example. simplify your question and write some code that defines some dataframes that replicate your problem. it will make it easier to help - jtorca
please tell me if my solution works - ansev

3 Answers

0
votes

df1.set_index('F1').join(df2.set_index('F2'))

Would this work? I'm not sure if I'm understanding the problem correctly

0
votes

I think if I'm understanding correctly you just care about merging observations based on the elements of the F1 and F2 columns, regardless of their order. Below I sort the F1 and F2 columns in both dataframes (sorting across columns not rows) and then I merge on these new columns.

import pandas as pd
import numpy as np

df1 = pd.DataFrame([
[0,        'AB CD',   'RS TU'],
[0,        'BC EF',   'GH IJ'],
[1,        'BC EF',   'NO PQ'],
[0,        'GH IJ',   'AB CD'],
[1,        'KL MN',   'RS TU'],
[0,        'NO PQ',   'UV WX']],
columns=['W', 'F1', 'F2']
)

df2 = pd.DataFrame([
['AB CD',   'GH IJ', -150, 500],
['BC EF',   'TG IP', 265, -150],
['BC EF',   'NO PQ', -500, 250],
['BC EF',   'GH IJ', 600, -700],
['KL MN',   'PP UY', 150, -600],
['RS TU',   'AB CD', -400, 350]],
columns=['F1', 'F2', 'B0_F1', 'B0_F2']
)

ix1 = pd.DataFrame(np.sort(df1[['F1', 'F2']].values, axis=1),
    columns=['F1', 'F2'])
df1_sorted = pd.concat([df1[[c for c in df1.columns if c not in ['F1', 'F2']]],
                 ix1], axis=1, sort=False)

ix2 = pd.DataFrame(np.sort(df2[['F1', 'F2']].values, axis=1),
    columns=['F1', 'F2'])
df2_sorted = pd.concat([df2[[c for c in df2.columns if c not in ['F1', 'F2']]],
                 ix2], axis=1, sort=False)

df3 = pd.merge(df1_sorted, df2_sorted, on=['F1', 'F2'], how='left')
0
votes

You can use DataFrame.merge independently for each value of W, then you can use pd.concat:

df=df1.reset_index() # to sort values at the end
w_mask=df1['W'].eq(1)
new_df=pd.concat([df[w_mask].merge(df2,left_on=['F1','F2'],right_on=['F1','F2']),
                  df[~w_mask].merge(df2.rename(columns={'F1':'F2','F2':'F1'}),
                   left_on=['F1','F2'],
                   right_on=['F1','F2'])

          ]).sort_values('index').set_index('index')

print(new_df)

Output

       W     F1     F2  BO_F1  BO_F2
index                               
0      0  AB CD  RS TU   -400    350
1      0  BC EF  GH IJ    600   -700
2      1  BC EF  NO PQ   -500    250
3      0  GH IJ  AB CD   -150    500
4      1  KL MN  RS TU   -600    700
5      0  NO PQ  UV WX    350   -900

Sample DataFrame:

    print(df1)
   W     F1     F2
0  0  AB CD  RS TU
1  0  BC EF  GH IJ
2  1  BC EF  NO PQ
3  0  GH IJ  AB CD
4  1  KL MN  RS TU
5  0  NO PQ  UV WX

print(df2)
      F1     F2  BO_F1  BO_F2
0  AB CD  GH IJ   -150    500
1  BC EF  TG IP    265   -150
2  BC EF  NO PQ   -500    250
3  BC EF  GH IJ    600   -700
4  KL MN  PP UY    150   -600
5  RS TU  AB CD   -400    350
6  KL MN  RS TU   -600    700
7  UV WX  NO PQ    350   -900
8  GH IJ  BC EF    600   -700

if there are rows in F1 and F2 of df1 that do not index with df2 and you do not want to lose them use how = 'left' in the DataFrame.merge method, or how = 'outer' if the lines present in df2 that are not in df1 are also.

pd.concat([df[w_mask].merge(df2,left_on=['F1','F2'],right_on=['F1','F2'],how='left'),
           df[~w_mask].merge(df2.rename(columns={'F1':'F2','F2':'F1'}),
                   left_on=['F1','F2'],
                   right_on=['F1','F2'],
                   how='left')

          ]).sort_values('index').set_index('index')