6
votes

Could you please help me with joining two DataFrames.

I have two DataFrames.

df1:

index   val1    val2
--------------------
1       str1    abc1
2       str2    abc2
3       str3    abc3
4       str4    abc9
5       str5    abc4

df2:

index   val2
------------
1       abc1
2       abc2
4       abc3 
5       abc4
9       abc5

I need to create a DataFrame based on the previous two with a left join by two columns. Columns index and val2 have the same names in both DataFrames. The result of df3 should look like this:

index   val1    val2    val3
----------------------------
1       str1    abc1    1
2       str2    abc2    1
3       str3    abc3    NaN
4       str4    abc9    NaN
5       str5    abc4    1

Indexes from df2 that are not present in df1 should be dropped, if an index in df1 has the same val2 as in df2 then 1 should be added to a new column val3, else: NaN.

Huge thanks in advance!

3
is index a column named index or the actual index of the DataFrame?ALollz

3 Answers

5
votes

This is one way. As below, I recommend you use Boolean rather than float for val3, as this what the series represents.

# merge and set index
res = df1.merge(df2, how='left').set_index('index')

# map val2 from df2
res['val3'] = df2.set_index('index')['val2']

# check for equality of val3 and val2
res['val3'] = res['val3'] == res['val2']

print(res)

       val1  val2   val3
index                   
1      str1  abc1   True
2      str2  abc2   True
3      str3  abc3  False
4      str4  abc9  False
5      str5  abc4   True
2
votes

You can try using join (default is left) with rsuffix so that column name for df2 is renamed with suffix. Then, using np.where to check if column value matches and assigning value for val3 column.

import numpy as np

df = df1.join(df2, rsuffix='_df2')
df['val3'] = np.where(df.val2 == df.val2_df2, 1, np.NaN)
del df['val2_df2']
print(df)

Result:

       val1  val2  val3
index                  
1      str1  abc1   1.0
2      str2  abc2   1.0
3      str3  abc3   NaN
4      str4  abc9   NaN
5      str5  abc4   1.0
1
votes

Since you want to merge on a combination of indices and columns you can either add them all to the index, or reset_index before the merge. We'll also assign the val3 column to df2 so it gets merged over.

(df1.reset_index().merge(
     df2.reset_index().assign(val3 = 1), on=['index', 'val2'], how='left')
    .set_index('index'))

Outputs:

       val1  val2  val3
index                  
1      str1  abc1   1.0
2      str2  abc2   1.0
3      str3  abc3   NaN
4      str4  abc9   NaN
5      str5  abc4   1.0

If 'index' is just a column and not the index, it's as simple as specifying two keys to merge on.

df1.merge(df2.assign(val3 = 1), on=['index', 'val2'], how='left')

Outputs:

   index  val1  val2  val3
0      1  str1  abc1   1.0
1      2  str2  abc2   1.0
2      3  str3  abc3   NaN
3      4  str4  abc9   NaN
4      5  str5  abc4   1.0