4
votes

I have two pandas data frames. The first is:

df1 = pd.DataFrame({"val1" : ["B2","A1","B2","A1","B2","A1"]})

The second data frame is:

df2 = pd.DataFrame({"val1" : ["A1","A1","A1","B2","B2","B2"],
                    "val2" : [10, 13, 16, 11, 20, 22]})

I would like to merge the two together in a way in which the row ordering from df1 is used and the values from df2 follow this ordering. Ideally, I would like it to look like this:

df_final = pd.DataFrame({"val1" : ["B2","A1","B2","A1","B2","A1"],
                         "val2" : [11, 10, 20, 13, 22, 16]})

I've tried using the merge function with left_on and right_on, but I don't get the output I'm looking for. Any help would be greatly appreciated.

2
Maybe df2.update(df1) and then sort vals?Padraic Cunningham
@PadraicCunningham, the question is how to sort them? ;)MaxU
@MaxU, it is pretty easy to do with regular pythonPadraic Cunningham
@PadraicCunningham, will you get the desired order then?MaxU
I don't think this problem is well-defined, is it? Val1 is not unique. So it seems like the requested result depends on some implicit ordering assumption within A1 and B2 groupings. It would be a good idea to make that assumption explicit here. And I'm skeptical this is really a good overall way to handle data to be honest.JohnE

2 Answers

1
votes

You can do it this way:

  1. sort values in df2 by ['val1', 'val2'], group it by val1 and store it as g2?
  2. add idx column to df1 which will be used in order to pick values from df2

Code:

In [176]: df1['idx'] = 1

In [177]: df1['idx'] = df1.groupby('val1')['idx'].cumsum()-1

In [178]: df1
Out[178]:
  val1  idx
0   B2    0
1   A1    0
2   B2    1
3   A1    1
4   B2    2
5   A1    2

In [179]: g2 = df2.sort_values(['val1', 'val2']).groupby('val1')

In [180]: g2.groups
Out[180]: {'A1': [0, 1, 2], 'B2': [3, 4, 5]}

In [181]: df2.iloc[g2.groups['A1'][1]]
Out[181]:
val1    A1
val2    13
Name: 1, dtype: object

In [182]: df1.apply(lambda x: df2.iloc[g2.groups[x['val1']][x['idx']]], axis=1)
Out[182]:
  val1  val2
0   B2    11
1   A1    10
2   B2    20
3   A1    13
4   B2    22
5   A1    16
0
votes

You could use groupby/cumcount to assign a unique number to each row within each group:

df1['cumcount'] = df1.groupby('val1').cumcount()
#   val1  cumcount
# 0   B2         0
# 1   A1         0
# 2   B2         1
# 3   A1         1
# 4   B2         2
# 5   A1         2

If we do the same to df2:

df2['cumcount'] = df2.groupby('val1').cumcount()
#   val1  val2  cumcount
# 0   A1    10         0
# 1   A1    13         1
# 2   A1    16         2
# 3   B2    11         0
# 4   B2    20         1
# 5   B2    22         2

then merging df1 with df2 on the common columns (val1 and cumcount) produces the desired result:

import numpy as np
import pandas as pd

df1 = pd.DataFrame({"val1" : ["B2","A1","B2","A1","B2","A1"]})
df2 = pd.DataFrame({"val1" : ["A1","A1","A1","B2","B2","B2"],
                    "val2" : [10, 13, 16, 11, 20, 22]})
df_final = pd.DataFrame({"val1" : ["B2","A1","B2","A1","B2","A1"],
                         "val2" : [11, 10, 20, 13, 22, 16]})

df1['cumcount'] = df1.groupby('val1').cumcount()
df2['cumcount'] = df2.groupby('val1').cumcount()
result = pd.merge(df1, df2, how='left')
result = result.drop('cumcount', axis=1)
print(result)
assert result.equals(df_final)

yields

  val1  val2
0   B2    11
1   A1    10
2   B2    20
3   A1    13
4   B2    22
5   A1    16

Note that merging with how='left' produces a result with the same number of rows as the first DataFrame, df1, and maintains the same order of rows as df1.