1
votes

I've got two csv files with data and I want to match them, and add a column which contains a flag value when they are matched. The first file is the "master" file with all the patients, and the second file contains the documents ids that have been signed.

Contents of file 1:

recid  visit   docid
1      arm1    1012
2      arm1    1023
3      arm1    1024
4      arm1    1026
5      arm1    1028
6      arm1    1031
7      arm1    1037
8      arm1    1040

Contents of file 2:

docid
1023
1024
1028

Desired output file:

recid  visit   docid  match
1      arm1    1012   
2      arm1    1023   1
3      arm1    1024   1
4      arm1    1026
5      arm1    1028   1
6      arm1    1031
7      arm1    1037
8      arm1    1040

Here's the code I've got so far:

import pandas as pd

# which file to read
IN_FILE1 = "patients.txt"
IN_FILE2 = "informedconsent.txt"
OUT_FILE = "output_matched.csv"
 
# load data from csv files
df_file1 = pd.read_csv (IN_FILE1, sep = "\t")
df_file2 = pd.read_csv (IN_FILE2, sep = "\t")

# merge on docid
df_merge = df_file1.merge(df_file2, on='docid', how='left')

# flag the matches
if df_merge['docid'].empty:
    df_merge['matched'] = ""
else: 
    df_merge['matched'] = "1"

print(df_merge) #test

# write to file
df_merge.to_csv (OUT_FILE, sep = "\t", index = False, header=True)

The problem is obviously that it is merged on the docid column so it's never empty. But the second file contains only one column, so no new columns are added, else I could check those new columns or something.

If I change it to how='inner' then it works but then I get only the matches, but the result needs to be all the records of the first file, even when they don't match.

When I edit file2 and change the column name to docid2 and change the code to this:

df_merge = df_file1.merge(df_file2, left_on='docid', right_on='docid2', how='left')

if df_merge['docid2'].empty:
    # etc..

It adds a docid2 column which only contains a guid value when there is a match, so that is correct. However the matched column still always contains "1".

Any ideas as to how to do this?

1

1 Answers

0
votes

Merge with indicator:

df3 = df1.merge(df2, on='docid', how='left', indicator='match')

df3:

   recid visit  docid      match
0      1  arm1   1012  left_only
1      2  arm1   1023       both
2      3  arm1   1024       both
3      4  arm1   1026  left_only
4      5  arm1   1028       both
5      6  arm1   1031  left_only
6      7  arm1   1037  left_only
7      8  arm1   1040  left_only

Then map to convert values:

df3['match'] = df3['match'].map({'both': 1, 'left_only': ''})

df3:

   recid visit  docid match
0      1  arm1   1012      
1      2  arm1   1023     1
2      3  arm1   1024     1
3      4  arm1   1026      
4      5  arm1   1028     1
5      6  arm1   1031      
6      7  arm1   1037      
7      8  arm1   1040