2
votes

I have two files like below:

file1.txt

2018-03-14 13:23:00 CID [72883359]
2018-03-14 13:23:00 CID [275507537]
2018-03-14 13:23:00 CID [275507539]
2018-03-14 13:23:00 CID [207101094]
2018-03-14 13:23:00 CID [141289821]

and file2.txt

2018-03-14 13:23:00 CID [207101072]
2018-03-14 13:23:00 CID [275507524]
2018-03-14 13:23:00 CID [141289788]
2018-03-14 13:23:00 CID [72883352]
2018-03-14 13:23:01 CID [72883359]
2018-03-14 13:23:00 CID [275507532]

I need to compare 4th colum of first file with 4th colum of 2nd file. I am using below command:

awk 'FNR==NR{a[$4]=$1" "$2" "$3; next} ($4 in a) {print a[$4],$4,$1,$2}' file1.txt file2.txt>file3.txt 

Its output is like below.

2018-03-14 13:23:00 CID [72883359] 2018-03-14 13:23:01

Above command works properly , but problem is file1 and file2 are huge and has some 20k lines and hence above command is taking time.

I want if a match is found , than it should skip the remaining column and go for next, means some kind of break statement . Please help.

Below is my script.

#!/bin/sh

cron=1;

for((j = $cron; j >= 1; j--))
do
    d1=`date -d "$date1  $j min ago" +%Y-%m-%d`
    d2=`date -d 'tomorrow' '+%Y-%m-%d'`
    
    t1=`date -d "$date1  2 min ago" +%R`
    t2=`date -d "$date1  1 min ago" +%R`
    t3=`date --date="0min" +%R`
done


cat /prd/firewall/logs/lwsg_event.log | egrep "$d1|$d2" | egrep "$t1|$t2|$t3" |  grep 'SRIR' | awk -F ' ' '{print $1,$2,$4,$5}'>file1.txt


cat /prd/firewall/logs/lwsg_event.log | egrep "$d1|$d2" | egrep "$t1|$t2|$t3" | grep 'SRIC' | awk -F ' ' '{print $1,$2,$4,$5}'>file2.txt


awk 'FNR==NR{a[$4]=$1" "$2" "$3; next} ($4 in a) {print a[$4],$4,$1,$2}' file1.txt file2.txt>file3.txt

cat file3.txt | while read LINE
do
    f1=`echo $LINE | cut -f 1 -d " "`
    f2=`echo $LINE | cut -f 2 -d " "`
    
    String1=$f1" "$f2
    
    f3=`echo $LINE | cut -f 5 -d " "`
    f4=`echo $LINE | cut -f 6 -d " "`
    
    String2=$f3" "$f4
    
    
    f5=`echo $LINE | cut -f 3 -d " "`
    f6=`echo $LINE | cut -f 4 -d " "`
    
    String3=$f5" "$f6
    
    StartDate=$(date -u -d "$String1" +"%s")
    FinalDate=$(date -u -d "$String2" +"%s")
    echo "Diff for $String3 :" `date -u -d "0 $FinalDate sec - $StartDate sec" +"%H:%M:%S"` >final_output.txt
done

final_output.txt will be

Diff for CID [142298410] : 00:00:01
Diff for CID [273089511] : 00:00:00
Diff for CID [273089515] : 00:00:00
Diff for CID [138871787] : 00:00:00
Diff for CID [273089521] : 00:00:00
Diff for CID [208877371] : 00:00:00
Diff for CID [138871793] : 00:00:00
Diff for CID [138871803] : 00:00:00
Diff for CID [273089526] : 00:00:00
Diff for CID [273089545] : 00:00:00
Diff for CID [208877406] : 00:00:02
Diff for CID [208877409] : 00:00:01
Diff for CID [138871826] : 00:00:00
Diff for CID [74659680] : 00:00:00
3
could you explain skip the remaining colum and go for next? it is not clear... regarding the cmd you've tried, looks the best possible for this case..Sundeep
skip means , if amatch is found , then no need to traverse whole colum of file2. If a match is found than read another value from 4th colum of file1 and find it in 4th colum of file2. So that script can complete its work fast.sunil.tanwar
sorry I still don't understand.. the script is reading each line of both files only once...Sundeep

3 Answers

0
votes

Could you please try following awk and let me know if this helps you.

awk 'FNR==NR{a[$4]=$0;next} ($4 in a){print a[$4],$1,$2}' file1.txt  file2.txt
0
votes

Have you considered the join command? Not many people seem to know about join.

NAME
       join - join lines of two files on a common field

SYNOPSIS
       join [OPTION]... FILE1 FILE2
0
votes

Your overall script reads the same file multiple times and contains a large number of other inefficiencies.

Without proper input to test with, it's hard to verify this, but here is a refactoring which should hopefully at least suggest a good direction for further exploration.

#!/bin/sh

cron=1;

for((j = $cron; j >= 1; j--))
do
    # Replace obsolescent `backticks` with $(modern command substitution) syntax
    d1=$(date -d "$date1  $j min ago" +%Y-%m-%d)
    d2=$(date -d 'tomorrow' '+%Y-%m-%d')
    
    t1=$(date -d "$date1  2 min ago" +%R)
    t2=$(date -d "$date1  1 min ago" +%R)
    t3=$(date --date="0min" +%R)
done

# Avoid useless cat and useless grep, fold everything into one Awk script
# See also http://www.iki.fi/era/unix/award.html
awk -v d="$d1|$d2" -v t="$t1|$t2|$t3" '
    $0 !~ d {next} $0 !~ t { next }
    { o = "" }
    /SRIR/ { o="file1.txt" }
    /SRIC/ { o="file2.txt" }
    o { {print $1,$2,$4,$5 > o; o="" }' /prd/firewall/logs/lwsg_event.log

awk 'FNR==NR{a[$4]=$1" "$2" "$3; next} ($4 in a) {print a[$4],$4,$1,$2}' file1.txt file2.txt>file3.txt

# Avoid uppercase for private variables
# Use read -r always
# Let read split the line
while read -r f1 f2 f5 f6 f3 f4 
do
    String1=$f1" "$f2
    String2=$f3" "$f4
    String3=$f5" "$f6
    
    StartDate=$(date -u -d "$String1" +"%s")
    FinalDate=$(date -u -d "$String2" +"%s")
    echo "Diff for $String3 :" $(date -u -d "0 $FinalDate sec - $StartDate sec" +"%H:%M:%S")
done <file3.txt >final_output.txt

I would guess that the main bottleneck was that you processed the log file multiple times, not so much in the small Awk fragment you run on the results which you asked for help with.

This could still probably be refactored into a single Awk script. If you have GNU Awk, you should be able to do the date calculations in Awk, too.