matching elements in multi columns from two files, then update or union them

1

votes

file1:

77, 4, -3, A0080
235, 5, -1, K0511

file2:

A0132, 77, -1, -2,  19.776
00000, 77, 4, -3,  18.608,
A0794, 235, -2, -2,  22.81
A0796, 235, -2, -5,  12.27
00000, 235, 5, -1,  18.992

desired output:

A0132, 77, -1, -2,  19.776
A0080, 77, 4, -3,  18.608,
A0794, 235, -2, -2,  22.81
A0796, 235, -2, -5,  12.27
K0511, 235, 5, -1,  18.992

basically to match column1, column2, column3 of file1 into column2, column3, column4 of file2, if match then replace the column1 of file2 by the value of column4 of file1.

I used:

awk 'FNR==NR {a[$1,$2,$3]++;next} a[$2,$3,$4]  {print $0}' file1 file2

to get the output

00000, 77, 4, -3,  18.608,
00000, 235, 5, -1,  18.992

Then I am stuck. Please help. BTW, this is for 2 files, how's about more than 2 files in general.

excelawksed

Try to provide some feedback: did the answer work to you? – fedorqui 'SO stop harming'

0

votes

Apparently there is some problem with trailing spaces. This complicates things a little bit, because you need to do the trick $field+=0 to overcome it (it removes trailing spaces).

You can try with this:

awk -F"," -v OFS=","
    'FNR==NR {$1+=0; $2+=0; $3+=0; a[$1,$2,$3]=$4;next}
     {$2+=0; $3+=0; $4+=0
      if (($2,$3,$4) in a) {$1=a[$2,$3,$4]}
      print
     }' f1 f2

Basically, it stores the value in the 4th column with index (1st, 2nd, 3rd) columns. Then, when reading the second file, it checks if the given index matches the 2nd, 3rd and 4th columns there; if so, it replaces the 1st field.

For your given input, it returns:

$ awk -F"," -v OFS="," 'FNR==NR {$1+=0; $2+=0; $3+=0; a[$1,$2,$3]=$4;next} {$2+=0; $3+=0; $4+=0; if (($2,$3,$4) in a) {$1=a[$2,$3,$4]} print}' f1 f2
A0132,77,-1,-2,  19.776
A0080,77,4,-3,  18.608,
A0794,235,-2,-2,  22.81
A0796,235,-2,-5,  12.27
K0511,235,5,-1,  18.992

0

votes

awk 'FILENAME==ARGV[1]{max++;a1[FNR]=$1;a2[FNR]=$2;a3[FNR]=$3;a4[FNR]=$4;next} {done=0;for (i=0;i<$max;i++) {if ($2==a1[i] && $3==a2[i] && $4=a3[i]) {$1=""; print a4[i]","$0; done=1; break}}; if (done==0){ print}}' file1 file2

Or easier to read:

awk 'FILENAME==ARGV[1]{ ## process file 1  
   max++;               ## keep track of how many entries in file 1 
   a1[FNR]=$1;          ## build separate arrays for each field we care about
   a2[FNR]=$2;
   a3[FNR]=$3;
   a4[FNR]=$4;
   next}                ## go to next file 
  {done=0;              ## set a flag so we know when we have no match 
   for (i=0;i<$max;i++) ## loop over all array entries in file 1  
   {if ($2==a1[i] && $3==a2[i] && $4=a3[i]) ## if columns match in our pairing   
     {$1="";            ## get rid of column 1   
      print a4[i]","$0; ## print out file 1 column 4 & column 2 onward for file 2 
      done=1;           ## set the flag so we know we had a match 
      break}};          ## break for loop, no need to waste time processing more
     if (done==0) {     ## if we did not match, print out the existing file 2 line 
        print}}'    
file1 file2

If you want to extend for more files you can add more clauses setting the ARGV of the filename like this (of course changing the logic to what you want) - also if you want it to be automated and flexible you can build this with a shell loop and use eval to execute it:

 awk 'FILENAME==ARGV[1]{a[FNR]=$0;a1[FNR]=$1;a2[FNR]=$2;a3[FNR]=$3;a4[FNR]=$4;next}
      FILENAME==ARGV[2]{b[FNR]=$0;b2[FNR]=$2;b3[FNR]=$3;b4[FNR]=$4;next}
      FILENAME==ARGV[3]{print "hi" a1[FNR] b2[FNR]}' file1 file2 file3

Update to handle the data results listed in the comments:

awk 'FILENAME==ARGV[1]{max++;a[FNR]=$0;a1[FNR]=$1;a2[FNR]=$2;a3[FNR]=$3;a4[FNR]=$4;$1="";$2="";$3="";$4="";a[FNR]=$0;gsub(",+$","",a[FNR]);next} {done=0;for (i=0;i<$max;i++) {if ($2==a1[i] && $3==a2[i] && $4=a3[i]) {$1=""; gsub(",+$","",$0);gsub(" +","",a[i]);print " "a4[i]$0","a[i]; done=1; break}}; if (done==0){ print}}' file1 file2

Changes that are made are to add the end fields from file 1 and clean up some cosmetics:

awk 'FILENAME==ARGV[1]
## save $0 in new array
{max++;a[FNR]=$0;a1[FNR]=$1;a2[FNR]=$2;a3[FNR]=$3;a4[FNR]=$4; 
## skip the first fields of new array up to field 4 and rid the trailing comma  
$1="";$2="";$3="";$4="";a[FNR]=$0;gsub(",+$","",a[FNR]); next} 
{done=0;for (i=0;i<$max;i++) {if ($2==a1[i] && $3==a2[i] && $4=a3[i]) 
{$1=""; gsub(",+$","",$0);gsub(" +","",a[i]); ## rid unnecessary whitespace 
## print the rest of file 1 line entry 
print " "a4[i]$0","a[i]; done=1; break}}; if (done==0){ print}}' file1 file2

0

votes

This might work for you (GNU sed):

sed -r 's|^(.*,)\s*(.*)|s/^(.*,) \1/\2, \1/|' file1 | sed -rf - file2

Create a sed script from file1 to run against file2.

0

votes

cat file1 file2 \
 | sed -n 'H;${x
:cycle
# \n
#:11
# 77, 4, -3, A0080
#^2222222222 44444
#^33
# A0132, 77, -1, -2,  19.776
#^55555555555555555555555555
# 00000, 77, 4, -3,  18.608,
#^       2222222222
   s/\(\n\)\(\([^,]*,\)\{3\}\) \([A-Z0-9]*\)\(.*\)00000, \2/\1\2 \4\5\4, \2/
   t cycle
:clean
   s/\(\n\)\([^,]*,\)\{3\} [A-Z0-9]*\1/\1/g
   t clean
   s/^\n//
   p
   }'

posix sed (so --posixon GNU sed). line statring with #^ give the grouping index in the line above so 2222222222 is the content of \2 used later in the pattern

load all line in working buffer
find occurence of each triplet [ ($[^,]*,$\{3\}\) in s/// line after :cycle] in later line [ the \2 ] with 00000, as prefix, replace by following 'name' of the triplet [ ``4`]
if found/replace, retry (1 change embrass next triplet so a g will always make only 1 change max) via t cycle that mean if s/// occur, goto the label cycle, if not continue to next line
clean the triplet (replace any line, pattern starting with new line, that have the fingerprint of a triple by just the new line [ \1 in this case ] and remove first new line that was added by using a h on first line that append the current line to the buffer (so first line is just a new line)
print the result

matching elements in multi columns from two files, then update or union them

4 Answers