2
votes

In the following case. I will like to find values which match in: file1: columns 8 & 9 with file2: columns 2 & 3

If the values are exactly the same in both files, then print like the desired output file

file1

31429,36689,313212.5,2334362.5,31429,36679,31308,302412.50 2316512.50
31429,36701,313362.5,2334362.5,31429,36681,31311,2334363,31429
31429,36713,313512.5,2334362.5,31429,36719,31358,303312.50 2316512.50
31429,36749,313962.5,2334362.5,31429,36751,31398,2334362,31429
31429,36809,314712.5,2334362.5,31429,36803,31463,2334361,31429
31429,36821,314862.5,2334362.5,31429,36817,31481,2334363,31429

file2

3000135825 302412.50 2316512.50
3000135837 302562.50 2316512.50
3000135849 302712.50 2316512.50
3000135861 302862.50 2316512.50
3000135873 303012.50 2316512.50
3000135885 303162.50 2316512.50
3000135897 303312.50 2316512.50
3000135909 303462.50 2316512.50
3000135921 303612.50 2316512.50
3000135933 303762.50 2316512.50
3000135945 303912.50 2316512.50

output desired

3000135825 302412.50 2316512.50 3667931308 302412.50 2316512.50
3000135897 303312.50 2316512.50 3671931358 303312.50 2316512.50

I tried Using this command i got the results, BUT it takes a lot time as the file2 have 3 millions of lines and the code take too much time To be able to use the code, first I create a temporary file named tmp1 with columns 5,6,8,9 from file1

awk -F, '{print($5$6,$8,$9)}' file1 > tmp1 

awk 'FNR==NR{a[$2$3]=$0;next}{print $0,a[$2$3]?a[$2$3]:"NA"}' file2 tmp1
4
what is the length of file1? If much less than file2 you can cache file1 contents instead. - karakfa
Hi karakfa, file1 has around 2000 lines only and 9 columns - OXXO

4 Answers

3
votes

If file1 length much less than file2, you can cache file1 contents instead.

something like this (not tested)

$ awk -F, 'NR==FNR      {a[$8,$9]==$6$7; next}   # is $6$7 the key you want to print?
           ($2,$3) in a {print $1,$2,$3,a[$2,$3]}' file1 FS=' ' file2

since the values should match there is no need to print them again. Not sure what is the fourth value printed in the output, but if it's coming from file1, just replace with it.

1
votes

For speed I would consider:

1 - using shell string commands as much as possible

2 - put only the necessary columns in a file

3 - sorting

4 - store files and output in variables - print and file commands take too long inside large loops

1
votes

Could you please try following.

awk 'FNR==NR{a[$8 OFS $9]=$6 $7 OFS $8 OFS $9;next} (($2 OFS $3) in a){print $0,a[$2 OFS $3]}' FS="[, ]"  Input_file1 FS=" " Input_file2

Adding a non-one liner form of solution now.

awk '
FNR==NR{
  a[$8 OFS $9]=$6 $7 OFS $8 OFS $9
  next
}
(($2 OFS $3) in a){
  print $0,a[$2 OFS $3]
}
' FS="[, ]"  Input_file1 FS=" "  Input_file2

Explanation: Adding explanation for above code too now.

awk '
FNR==NR{                              ##Checking condition FNR==NR this will be TRUE when firt Input_file named Input_file1 is being executed.
  a[$8 OFS $9]=$6 $7 OFS $8 OFS $9    ##Creating an array named a whose index is $8 OFS $9 and value if $6 $7 OFS $8 OFS $9.
  next                                ##next keyword is out of the box of awk and will skip further statements from here.
}
(($2 OFS $3) in a){                   ##Statements from here will be executed when 2nd Input_file is being read named Input_file2. Checkingh condition if $2 OFS $3 is present in array a then do following.
  print $0,a[$2 OFS $3]               ##Printing current line along with value of array a whose index is $2 OFS $3.
}                                     ##Closing block for above condition now.
' FS="[, ]" Input_file1 FS=" " Input_file2        ##Setting FS for Input_file1 as comma OR space here then mentioning Input_file1 name then setting FS as space and mentioning Input_file2 name here.
1
votes

Since you are concerned about performance, could you pls try this Perl solution.

$ perl -lne 'BEGIN{@x=map{chomp;@k=split(/[ ,]/,$_);$kv{"$k[-2] $k[-1]"}="$k[-4]$k[-3]"} qx(cat file1.txt)} /(\S+) (\S+)$/ and $kv{$&} and print $_," ",$kv{$&}, " ",$& ' f
ile2.txt
3000135825 302412.50 2316512.50 3667931308 302412.50 2316512.50
3000135897 303312.50 2316512.50 3671931358 303312.50 2316512.50

$