0
votes

I have 2 files file1.csv and file2.csv

file1.csv contains only 1 column with hundreds of rows.

aaa
ddd
fff
ggg

file2.csv contains 5 fields with thousands of rows.

aaa,2,3,4,
aaa,2,3,4, 
bbb,2,3,4,
ccc,2,3,4, 
ccc,2,3,4, 
ddd,2,3,4, 
ddd,2,3,4,
ddd,2,3,4,  
eee,2,3,4, 
fff,2,3,4, 
ggg,2,3,4, 
hhh,2,3,4, 
hhh,2,3,4,   

My task is to check if col1 present in file1.csv matches with col1 in fil2.csv then change the 5th column to Y in file2.csv

Desired output

aaa,2,3,4,Y
aaa,2,3,4,Y 
bbb,2,3,4, 
ccc,2,3,4, 
ccc,2,3,4, 
ddd,2,3,4,Y
ddd,2,3,4,Y
ddd,2,3,4,Y  
eee,2,3,4, 
fff,2,3,4,Y 
ggg,2,3,4,Y 
hhh,2,3,4, 
hhh,2,3,4, 

what i tried is

for i in $(cat file1.csv); do awk -F "," '$1==$i{$5="Y"}1' OFS="," file2.csv ; done

But I am getting only the matched records but not the unmatched records.

Is there a better way I can achieve this in UNIX using awk,sed or other common utility.

EDIT: Update question with clear example

2

2 Answers

1
votes

No need to do it like that, just awk reads two files is okay:

awk -F, 'NR==FNR{a[$1]++;next;}a[$1]{$5="Y"}1' file1.csv file2.csv

Unsure if the title and the 2nd line is your way of explaining, if you want to get rid of them:

awk -F, 'NR==FNR{a[$1]++;next;}a[$1]{$5="Y"}FNR>2' file1.csv file2.csv

Note if file1.csv could be empty, you should change NR==FNR to different file checking methods, like ARGIND==1 for GNU awk, or FILENAME=="file1.csv" etc.

If there're huge data to dealing with, change a[$1]++ to a[$1]=1 will slightly improve speed.
Also if you want to keep title (or the 2nd line), then it's better to start change array a when FNR>1 or FNR>2. Improve the command yourself, I am sure you got the idea ;)

NR==FNR means first file, because NR means Number of total Records now, and FNR means current File's Number of Records.
a is an array to actually save $1 as keys.
next skip other block from executing.

When NR==FNR is false, means it's not first file, first block will not be executed.
(You can also use NR>FNR before a block to specify it, but since I used next in NR==FNR block, so it is not necessary.)
Then other directives kick in, a[$1] is to judge the key exists in the array a or not (by refer the value, literally check key exists actually should be $1 in a), if it exists, then change $5.
Last 1 is to specify a true value, it's a shortcut for {print}.
(Since an expression without a block, will have {print} implied, and a block is executed when the afore expression calculated to true, for which 1 is always true.)

0
votes

You can try the Perl solution

$ perl -F, -lane 'BEGIN {%kv=map{chomp;$_=>1} qx(cat file1.csv) } print "$_", $kv{$F[0]}? "Y" : "" ' file2.csv
aaa,2,3,4,Y
aaa,2,3,4,Y
bbb,2,3,4,
ccc,2,3,4,
ccc,2,3,4,
ddd,2,3,4,Y
ddd,2,3,4,Y
ddd,2,3,4,Y
eee,2,3,4,
fff,2,3,4,Y
ggg,2,3,4,Y
hhh,2,3,4,
hhh,2,3,4,

$ cat file1.csv
aaa
ddd
fff
ggg

$ cat file2.csv
aaa,2,3,4,
aaa,2,3,4,
bbb,2,3,4,
ccc,2,3,4,
ccc,2,3,4,
ddd,2,3,4,
ddd,2,3,4,
ddd,2,3,4,
eee,2,3,4,
fff,2,3,4,
ggg,2,3,4,
hhh,2,3,4,
hhh,2,3,4,

$