Compare two columns in two csv files in python

Question

I have two csv files with same columns name:

In file1 I got all the people who made a test and all the status (passed/missed)
In file2 I only have those who missed the test

I'd like to compare file1.column1 and file2.column1

If they match then compare file1.column4 and file2.column4
If they are different remove item line from file2

I can't figure how to do that. I looked things with pandas but I didn't manage to do anything that works

What I have is:

file1.csv:

name;DOB;service;test status;test date
Smith;12/12/2012;compta;Missed;01/01/2019
foo;02/11/1989;office;Passed;01/01/2019
bar;03/09/1972;sales;Passed;02/03/2018
Doe;25/03/1958;garage;Missed;02/04/2019
Smith;12/12/2012;compta;Passed;04/05/2019

file2.csv:

name;DOB;service;test status;test date
Smith;12/12/2012;compta;Missed;01/01/2019
Doe;25/03/1958;garage;Missed;02/04/2019

What I want to get is:

file1.csv:

name;DOB;service;test status;test date
Smith;12/12/2012;compta;Missed;01/01/2019
foo;02/11/1989;office;Passed;01/01/2019
bar;03/09/1972;sales;Passed;02/03/2018
Doe;25/03/1958;garage;Missed;02/04/2019
Smith;12/12/2012;compta;Passed;04/05/2019

file2.csv:

name;DOB;service;test status;test date
Doe;25/03/1958;garage;Missed;02/04/2019

Which columns do you mean with 'Column1' and 'Column4'. Can you give names? — Erfan
And why did row 2 in file2 get removed? They are exactly the same in both dataframes. — Erfan
row2 in file2 is removed because Passed (ssucceeded) the test file1 keep everyone status at each time and file2 contains only people who missed the test and have to retry — ricardo

Luiz Fernando Lobo Luiz Fernando Lobo · Accepted Answer · 2019-07-15T16:24:51

So first you will have to open:

    import pandas as pd
    df1 = pd.read_csv('file1.csv',delimiter=';')
    df2 = pd.read_csv('file2.csv',delimiter=';')

Treating the data frame, because of white spaces found

    df1.columns= df1.columns.str.strip()
    df2.columns= df2.columns.str.strip()
    # Assuming only strings
    df1 = df1.apply(lambda column: column.str.strip())
    df2 = df2.apply(lambda column: column.str.strip())

The solution expected, Assuming that your name is UNIQUE.

Merging the files

    new_merged_df = df2.merge(df1[['name','test status']],'left',on=['name'],suffixes=('','file1'))

DataFrame Result:

    name         DOB service test status   test date test statusfile1
0  Smith  12/12/2012  compta      Missed  01/01/2019           Missed
1  Smith  12/12/2012  compta      Missed  01/01/2019           Passed
2    Doe  25/03/1958  garage      Missed  02/04/2019           Missed

Filtering based on the requirements and removing the rows with the name with different test status.

    filter = new_merged_df['test status'] != new_merged_df['test statusfile1']
    # Check if there is different values
    if len(new_merged_df[filter]) > 0:
       drop_names = list(new_merged_df[filter]['name'])
       # Removing the values that we don't want
       new_merged_df = new_merged_df[~new_merged_df['name'].isin(drop_names)]

Removing columns and storing

    # Saving as a file with the same schema as file2
    new_merged_df.drop(columns=['test statusfile1'],inplace=True)
    new_merged_df.to_csv('file2.csv',delimiter=';',index=False)

Result

  name         DOB service test status   test date
2  Doe  25/03/1958  garage      Missed  02/04/2019

Compare two columns in two csv files in python

1 Answers