I'm doing a cross-check between 2 dataframes to assign a value to a flag. If a specific key is present in both dataframes with a different value, the flag will be set to "change" for that row. If the value is the same, the flag will be set to "no change". However if a specific key is present more than once in only one of the 2 dataframes, then the value of the flag will be "add". Let me give an example to make it clearer:
df 1:
| key | value | key value present in df 2 |
|---|---|---|
| abcd | 1 | False |
| wxyz | 5 | True |
df 2:
| key | value | key value present in df 1 |
|---|---|---|
| abcd | 2 | False |
| wxyz | 5 | True |
Then the result will be for dataframe 1:
df 1:
| key | value | key value present in df 2 | xcheck_flag |
|---|---|---|---|
| abcd | 1 | False | change |
| wxyz | 5 | True | no change |
To get this result I use the following logic:
def changeType(df1):
def condition_check(row):
if (row['key value present in df 2'] == False):
return 'change'
else:
return 'no change'
df1['xcheck_flag']= df1.apply(condition_check, axis=1)
Now this is rather straightforward, right? Well I have a complication which I haven't been able to solve, yet.
Imagine the following use case:
df 1:
| key | value | key value present in df 2 |
|---|---|---|
| abcd | 1 | False |
| wxyz | 5 | True |
| abcd | 3 | False |
df 2:
| key | value | key value present in df 1 |
|---|---|---|
| abcd | 2 | False |
| wxyz | 5 | True |
In this case, the key abcd appears twice in df 1 and only once in df 2. If this happen, I need to apply the following logic when doing the cross-dataframe check: the first time I will match the key with dataframe 2, then set the value of the flag to change like in previous case; the second time we match the value, then set the flag to "additional change". It doesn't matter which row from df 1 gets assigned the value "change" or "additional". The only condition is that when you have such a case, only one key-value gets assigned with "change" and then all the others that might happen get assigned with "additional"
This give us:
df 1:
| key | value | key value present in df 2 | xcheck_flag |
|---|---|---|---|
| abcd | 1 | False | change |
| wxyz | 5 | True | no change |
| abcd | 3 | True | additional change |
I've been trying to adapt my initial function to include this behaviour but without success.
If you have any hint, it would be greatly welcomed!