Finding differences between 2 csv files using unix command or shell scripting

Question

I am a newbie to Unix and shell scripting. I am trying to find differences between 2 .csv files using Unix command. There are conditions on the basis of which I have to find the difference.

Update to an entry i.e. any row in file1 (unique id is employee id) exists in file2 but different value for another column. It is considered as an update. In such case, I need that entry from file2
If an entry exists in file2 and not in the file, It is considered as the addition of new employee. I need that row from file2.
If an entry exists in file1 and not in the file2, It is considered as the deletion of an employee. I need that row from file1.

I am able to find the update and new records using comm -23 sorted_file_2.csv sorted_file_1.csv > updates.csv but not able to find an entry that is deleted.

I have checked solutions using below commands

grep -v -x -f sorted_file_2.csv sorted_file_1.csv > deleted.csv

awk 'NR==FNR{a[$0]=1;next}!a[$0]' sorted_file_2.csv sorted_file_1.csv > deleted.csv

diff sorted_file_1.csv sorted_file_2.csv > deleted.csv

The above commands are always giving me entries that are updated as well as deleted. What I am looking for the only entry from file1 which is not in file2

P.S. The two file can contain all the 3 cases mentioned above. I need output in two csv files. One for update/new records and another one for deleted records.

File1.csv

Row|Employee_ID|Salary|Designation 1|John|2000|Clerk 2|Smith|3000|Supervisor 3|Jenny|1000|Intern 4|Vicky|5000|Manager

File2.csv

Row|Employee_ID|Salary|Designation 1|John|2000|Clerk 2|Smith|4000|Senior Supervisor 4|Vicky|5000|Manager 5|James|5000|Auditor

In the above 2 files Row #2 in file2 is an update, Row#5 is a new entry.Both of them can be combined in single file as

Update_new.csv

2|Smith|4000|Senior Supervisor
5|James|5000|Auditor

Deleted entry is row#3 in file1.csv which is not present in file2.csv to be kept in separate file deleted.csv

3|Jenny|1000|Intern

It is fine even if I am able to add all the two files in single file with one extra coloum specifiyng "UPDATED","NEW","DELETED" value.

I need the result in csv format only so that I can be read further by java application.Which need to further update, add and delete entries from other application — Ravi.Kumar

udiboy1209 udiboy1209 · Accepted Answer · 2018-06-15T10:47:07

You can easily do this with diff itself.

Test files I used

1.csv:

a,1
b,2
d,3
f,5

2.csv:

b,2
c,6
d,3
f,4

This is output of diff

$ diff 1.csv 2.csv
1d0
< a,1
2a2
> c,6
4c4
< f,5
---
> f,4

You can search for the letters 'a', 'd', 'c' to get added, deleted and changed lines. Here is an example of how to get added lines

$ diff 1.csv 2.csv | grep -A1 '^[0-9]*a'
2a2
> c,6

You can properly extract only csv using another sed command

$ diff 1.csv 2.csv | grep -A1 '^[0-9]*a' | sed -n '/^[><]/s/[><] \(.*\)$/\1/p'
c,6

You can easily change the grep and sed commands to format in whatever way you want.

Finding differences between 2 csv files using unix command or shell scripting

2 Answers