1
votes

I would like to merge file1 4th column with file2 1st column with awk and I would like to print 2nd column from file $1. If more than one match (could be more than 100), print it separated by comma.

FILE1:

alo descrip 1  PAPA
alo descrip 2  LOPA
alo descrip 3  REP
alo descrip 4  SEPO
dlo sapro   31 REP
dlo sapro   35 PAPA

FILE2:

PAPA klob trop
PAPA kopo topo
HOJ  sasa laso
REP  deso rez
SEPO raz  ghul
REP  kok  loko

OUTPUT:

PAPA klob trop descrip,sapro
PAPA kopo topo descrip,sapro
HOJ  sasa laso NA
REP  deso rez  descrip,sapro
SEPO raz  ghul descrip
REP  kok  loko descrip,sapro

I tried:

awk -v FILE_A="FILE1" -v OFS="\t" 'BEGIN { while ( ( getline < FILE_A ) > 0 ) { VAL = $0 ; sub( /^[^ ]+ /, "", VAL ) ; DICT[ $1 ] = VAL } } { print $0, DICT[ $4 ] }' FILE2

but it doesn't work.

2
Based on what you are asking, I think this might be what you need.Aditya Vartak
And iti is not problem if I have repetion on both files?Vonton

2 Answers

3
votes

Could you please try following.

awk '
FNR==NR{
  a[$NF]=(a[$NF]?a[$NF] ",":"")$2
  next
}
{
  printf("%s %s\n",$0,($1 in a)?a[$1]:"NA")
}
'  Input_file1  Input_file2

Explanation: Adding detailed explanation for above code.

awk '                                          ##Starting awk program fro here.
FNR==NR{                                       ##Checking condition FNR==NR whioh will be TRUE when Input_file1 is being read.
  a[$NF]=(a[$NF]?a[$NF] ",":"")$2              ##Creating arra a with index $NF, its value is keep appending to its own value with $2 of current line.
  next                                         ##next will skip all further lines from here.
}
{
  printf("%s %s\n",$0,($1 in a)?a[$1]:"NA")    ##Printing current line then either value of array or NA depending upon if condition satisfies.
}
'  Input_file1 Input_file2                     ##Mentioning Input_file names here.
3
votes

In essence the question was how to store data to an array when there are duplicated keys. @RavinderSingh13 demonstrated gloriously how to append data to indexed array elements. Another way is to use multidimensional arrays. Here is a sample how to use them in GNU awk:

$ gawk '                                               # using GNU awk
NR==FNR {                                              # process first file
    a[$4][++c[$4]]=$2                                  # 2d array
    next
}
{                                                      # process second file
    printf "%s%s",$0,OFS                               # print the record
    if($1 in a)                                        # if key is found in array
        for(i=1;i<=c[$1];i++)                          # process related dimension
            printf "%s%s",a[$1][i],(i==c[$1]?ORS:",")  # and output elements
    else                                               # if key was not in array
        print "NA"                                     # output NA
}' file1 file2