I've got two csv files with data and I want to match them, and add a column which contains a flag value when they are matched. The first file is the "master" file with all the patients, and the second file contains the documents ids that have been signed.
Contents of file 1:
recid visit docid
1 arm1 1012
2 arm1 1023
3 arm1 1024
4 arm1 1026
5 arm1 1028
6 arm1 1031
7 arm1 1037
8 arm1 1040
Contents of file 2:
docid
1023
1024
1028
Desired output file:
recid visit docid match
1 arm1 1012
2 arm1 1023 1
3 arm1 1024 1
4 arm1 1026
5 arm1 1028 1
6 arm1 1031
7 arm1 1037
8 arm1 1040
Here's the code I've got so far:
import pandas as pd
# which file to read
IN_FILE1 = "patients.txt"
IN_FILE2 = "informedconsent.txt"
OUT_FILE = "output_matched.csv"
# load data from csv files
df_file1 = pd.read_csv (IN_FILE1, sep = "\t")
df_file2 = pd.read_csv (IN_FILE2, sep = "\t")
# merge on docid
df_merge = df_file1.merge(df_file2, on='docid', how='left')
# flag the matches
if df_merge['docid'].empty:
df_merge['matched'] = ""
else:
df_merge['matched'] = "1"
print(df_merge) #test
# write to file
df_merge.to_csv (OUT_FILE, sep = "\t", index = False, header=True)
The problem is obviously that it is merged on the docid
column so it's never empty. But the second file contains only one column, so no new columns are added, else I could check those new columns or something.
If I change it to how='inner'
then it works but then I get only the matches, but the result needs to be all the records of the first file, even when they don't match.
When I edit file2 and change the column name to docid2
and change the code to this:
df_merge = df_file1.merge(df_file2, left_on='docid', right_on='docid2', how='left')
if df_merge['docid2'].empty:
# etc..
It adds a docid2
column which only contains a guid value when there is a match, so that is correct. However the matched
column still always contains "1".
Any ideas as to how to do this?