Overlaying one dataframe with another and keeping only new or changed rows

Question

I have two pandas dataframes that look kind of like the following:

df1:

RecorderID    GroupID    Location    ...    SomeColumn
CT-1000001    BV-        Cape Town          SomeValue
CT-1000002    MP-        Johannesburg       SomeValue
CT-1000003    BV-        Durban             SomeValue

df2:

RecorderID    GroupID    Location    ...    SomeColumn
CT-1000001    BV-        Durban      ...    SomeValue
CT-1000003    BV-        Durban      ...    SomeValue

These two dataframes are large in reality, with many columns and many rows. I want to compare the two dataframes and end with one dataframe accomplishing the following (RecorderID is my primary key):

All rows who's values differ in the two dataframes must adopt df1's values and be kept.
All rows present in df1 but not present in df2 must be inserted.
All values that are contained and the same in both dataframes must be removed.

So, taking the above example, I would end up with the following dataframe:

RecorderID    GroupID    Location    ...    SomeColumn
CT-1000001    BV-        Cape Town          SomeValue
CT-1000002    MP-        Johannesburg       SomeValue

PS: I've noticed when writing out a dataframe to Excel, it inserts an index column as the first column. How do I specify that RecorderID is my primary key and that it should use that to index values? I've tried:

df = read_excel('file.xlsx', 'sheet1', index_col='RecorderID')

but that just removes the RecorderID column and adds a numbered index column anyway when I write it out to excel.

Thanks!

EdChum EdChum · Accepted Answer · 2016-01-26T12:16:54

If you're running a recent version of pandas then you can merge and specify the merge method to be left, additionally we can set indicator=True this adds a column _merge which will you tell if the rows are present in left_only or both, we can then filter those rows out:

In [91]:
merged = pd.merge(df1,df2,indicator=True, how='left' )
merged

Out[91]:
   RecorderID GroupID      Location SomeColumn     _merge
0  CT-1000001     BV-     Cape Town  SomeValue  left_only
1  CT-1000002     MP-  Johannesburg  SomeValue  left_only
2  CT-1000003     BV-        Durban  SomeValue       both

In [92]:
merged[merged['_merge'] == 'left_only']

Out[92]:
   RecorderID GroupID      Location SomeColumn     _merge
0  CT-1000001     BV-     Cape Town  SomeValue  left_only
1  CT-1000002     MP-  Johannesburg  SomeValue  left_only

Overlaying one dataframe with another and keeping only new or changed rows

1 Answers