0
votes

I am a newbie to Unix and shell scripting. I am trying to find differences between 2 .csv files using Unix command. There are conditions on the basis of which I have to find the difference.

  1. Update to an entry i.e. any row in file1 (unique id is employee id) exists in file2 but different value for another column. It is considered as an update. In such case, I need that entry from file2
  2. If an entry exists in file2 and not in the file, It is considered as the addition of new employee. I need that row from file2.
  3. If an entry exists in file1 and not in the file2, It is considered as the deletion of an employee. I need that row from file1.

I am able to find the update and new records using comm -23 sorted_file_2.csv sorted_file_1.csv > updates.csv but not able to find an entry that is deleted.

I have checked solutions using below commands

grep -v -x -f sorted_file_2.csv sorted_file_1.csv > deleted.csv

awk 'NR==FNR{a[$0]=1;next}!a[$0]' sorted_file_2.csv sorted_file_1.csv > deleted.csv

diff sorted_file_1.csv sorted_file_2.csv > deleted.csv

The above commands are always giving me entries that are updated as well as deleted. What I am looking for the only entry from file1 which is not in file2

P.S. The two file can contain all the 3 cases mentioned above. I need output in two csv files. One for update/new records and another one for deleted records.

File1.csv

Row|Employee_ID|Salary|Designation 1|John|2000|Clerk 2|Smith|3000|Supervisor 3|Jenny|1000|Intern 4|Vicky|5000|Manager

File2.csv

Row|Employee_ID|Salary|Designation 1|John|2000|Clerk 2|Smith|4000|Senior Supervisor 4|Vicky|5000|Manager 5|James|5000|Auditor

In the above 2 files Row #2 in file2 is an update, Row#5 is a new entry.Both of them can be combined in single file as

Update_new.csv

2|Smith|4000|Senior Supervisor
5|James|5000|Auditor

Deleted entry is row#3 in file1.csv which is not present in file2.csv to be kept in separate file deleted.csv

3|Jenny|1000|Intern

It is fine even if I am able to add all the two files in single file with one extra coloum specifiyng "UPDATED","NEW","DELETED" value.

2
Why not using diff?oliv
I need the result in csv format only so that I can be read further by java application.Which need to further update, add and delete entries from other applicationRavi.Kumar
Post a sample of both files with the expected output.James Brown

2 Answers

1
votes

You can easily do this with diff itself.

Test files I used

1.csv:

a,1
b,2
d,3
f,5

2.csv:

b,2
c,6
d,3
f,4

This is output of diff

$ diff 1.csv 2.csv
1d0
< a,1
2a2
> c,6
4c4
< f,5
---
> f,4

You can search for the letters 'a', 'd', 'c' to get added, deleted and changed lines. Here is an example of how to get added lines

$ diff 1.csv 2.csv | grep -A1 '^[0-9]*a'
2a2
> c,6

You can properly extract only csv using another sed command

$ diff 1.csv 2.csv | grep -A1 '^[0-9]*a' | sed -n '/^[><]/s/[><] \(.*\)$/\1/p'
c,6

You can easily change the grep and sed commands to format in whatever way you want.

0
votes

Using awk:

$ awk -F\| '
NR==FNR {                       # store first file to a hash
    a[$1]=$0                    # $1 is the key
    next
}
($1 in a) {                     # process second file. If key in a
    if(a[$1]!=$0)               # but data has changed
        print > "updated"       # output to updated file
    delete a[$1]                # delete entry from hash
    next                        # skip to next record in file
}
{
    print > "updated"           # if $1 not in a, it is new, output
}
END {                           # after file 2
    for(i in a)                 # print all leftovers from file 1
        print a[i] > "deleted"  # output to deleted
}' 1.csv 2.csv

Results:

$ cat deleted
3|Jenny|1000|Intern
$ cat updated
2|Smith|4000|Senior Supervisor
5|James|5000|Auditor