I have two pandas dataframes that look kind of like the following:
df1:
RecorderID GroupID Location ... SomeColumn
CT-1000001 BV- Cape Town SomeValue
CT-1000002 MP- Johannesburg SomeValue
CT-1000003 BV- Durban SomeValue
df2:
RecorderID GroupID Location ... SomeColumn
CT-1000001 BV- Durban ... SomeValue
CT-1000003 BV- Durban ... SomeValue
These two dataframes are large in reality, with many columns and many rows. I want to compare the two dataframes and end with one dataframe accomplishing the following (RecorderID is my primary key):
- All rows who's values differ in the two dataframes must adopt df1's values and be kept.
- All rows present in df1 but not present in df2 must be inserted.
- All values that are contained and the same in both dataframes must be removed.
So, taking the above example, I would end up with the following dataframe:
RecorderID GroupID Location ... SomeColumn
CT-1000001 BV- Cape Town SomeValue
CT-1000002 MP- Johannesburg SomeValue
PS: I've noticed when writing out a dataframe to Excel, it inserts an index column as the first column. How do I specify that RecorderID is my primary key and that it should use that to index values? I've tried:
df = read_excel('file.xlsx', 'sheet1', index_col='RecorderID')
but that just removes the RecorderID column and adds a numbered index column anyway when I write it out to excel.
Thanks!