4
votes

I have multiple files with different number of columns, i need to do a merge on first file and second file and do a left outer join in awk respective to first file and print all columns in both files matching the first column of both files.

I have tried below codes to get close to my output. But i can't print the ",', where no matching number is found in second file. Below is the code. Join needs sorting and takes more time than awk. My file sizes are big, like 30 million records.

awk -F ',' '{  
    if (NR==FNR){ r[$1]=$0}
    else{ if($1 in r) 
    r[$1]=r[$1]gensub($1,"",1)}
}END{for(i in r){print r[i]}}' file1 file2

file1

number,column1,column2,..columnN

File2

numbr,column1,column2,..columnN

Output

number,file1.column1,file1.column2,..file1.columnN,file2.column1,file2.column3...,file2.columnN

file1

1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c

file2

1,x,y
2,x,y
5,x,y
6,x,y
7,x,y

desired output

1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,,
5,a,b,c,x,y
3
Please, show us the code you have tried and the problems you have found.Jdamian
how many rows do you have? I can provide a simple script to parsing but will be terrible slow for big filesm47730

3 Answers

1
votes
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
    tail = gensub(/[^,]*,/,"",1)
    if ( FNR == 1 ) {
        empty = gensub(/[^,]/,"","g",tail)
    }
    file2[$1] = tail
    next
}
{ print $0, ($1 in file2 ? file2[$1] : empty) }

$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y

The above uses GNU awk for gensub(), with other awks it's just one more step to do [g]sub() on the appropriate variable after initially assigning it.

An interesting (to me at least!) alternative you might want to test for a performance difference is:

$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
    tail = gensub(/[^,]*,/,"",1)
    idx[$1] = NR
    file2[NR] = tail
    if ( FNR == 1 ) {
        file2[""] = gensub(/[^,]/,"","g",tail)
    }
    next
}
{ print $0, file2[idx[$1]] }

$ awk -f tst.awk file2 file1
1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y

but I don't really expect it to be any faster and it MAY even be slower.

0
votes

you can try,

awk 'BEGIN{FS=OFS=","}
   FNR==NR{d[$1]=substr($0,index($0,",")+1); next}
   {print $0, ($1 in d?d[$1]:",")}' file2 file1

you get,

1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y
0
votes

join to the rescue:

$ join -t $',' -a 1 -e '' -o 0,1.2,1.3,1.4,2.2,2.3 file1.txt file2.txt

Explanation:

-t $',': Field separator token.

-a 1: Do not discard records from file 1 if not present in file 2.

-e '': Missing records will be treated as an empty field.

-o: Output format.

file1.txt

1,a,b,c
2,a,b,c
3,a,b,c
5,a,b,c

file2.txt

1,x,y
2,x,y
5,x,y
6,x,y
7,x,y

Output

1,a,b,c,x,y
2,a,b,c,x,y
3,a,b,c,,
5,a,b,c,x,y