awk to print Missing sequence gap and min-max values:

Question

Would like to print missing sequence gap from First Column ( Start Missing Sequence , End Missing Sequence) Then need to print Minimum & Maximum sequence of that First Column And the combinations of $2,substr($3,4,6),substr($4,4,6),$6,$8,$10 fields. Input file is not sorted as per first column.

Input.csv

21,abc,22-JUN-12.08:06:03,22-JUN-12.08:06:03,19-Apr-16,1,INR,RO0412,RC03,L7,,31
22,abc,22-JUN-12.08:06:03,22-JUN-12.08:06:03,19-Apr-16,1,INR,RO0412,RC03,L7,,31
23,abc,22-JUN-12.08:06:03,22-JUN-12.08:06:03,19-Apr-16,1,INR,RO0412,RC03,L7,,31
24,abc,30-JUN-12.01:06:49,30-JUN-12.01:06:49,19-Apr-16,1,INR,RO0412,RC03,L7,,29
28,abc,30-JUN-12.01:06:49,30-JUN-12.01:06:49,19-Apr-16,1,INR,RO0412,RC03,L7,,29
32,abc,29-MAY-13.12:05:11,29-MAY-13.12:05:11,15-Feb-17,1350,INR,RO0213,CD,K1,,30
38,abc,29-MAY-13.12:05:11,29-MAY-13.12:05:11,15-Feb-17,1350,INR,RO0213,CD,K1,,30
41,abc,20-FEB-14.11:02:37,20-FEB-14.11:02:37,31-Dec-20,650,INR,EN1113,ch650,S317,,28
46,abc,20-FEB-14.11:02:37,20-FEB-14.11:02:37,31-Dec-20,650,INR,EN1113,ch650,S317,,28
51,abc,20-FEB-14.11:02:37,20-FEB-14.11:02:37,31-Dec-20,650,INR,EN1113,ch650,S317,,28
52,abc,20-FEB-14.11:02:37,20-FEB-14.11:02:37,31-Dec-20,650,INR,EN1113,ch650,S317,,28

Have tried this command and got the partial output:

cat Input.csv | \
awk -F, '{OFS=","; print $1,$2,substr($3,4,6),substr($4,4,6),$6,$8,$10}' | \
sort -k1 -t, | \
awk -F, 'BEGIN {OFS=","} (($1!=p+1) && ($7==p7)) {print p,p2,p3,p4,p5,p6,p7,p+1 "," $1-1,$1} {p=$1;p2=$2;p3=$3;p4=$4;p5=$5;p6=$6;p7=$7}'

Above command output header name is:

Minimum Seq ($1),$2,substr($3,4,6),substr($4,4,6),$6,$8,$10,start Missing Seq ($1),End Missing Seq ($1),Maximum Seq ($1)

24,abc,JUN-12,JUN-12,1,RO0412,L7,25,27,28
32,abc,MAY-13,MAY-13,1350,RO0213,K1,33,37,38
41,abc,FEB-14,FEB-14,650,EN1113,S317,42,45,46
46,abc,FEB-14,FEB-14,650,EN1113,S317,47,50,51

In the above output - Minimum Seq ($1),Maximum Seq ($1) value is not correct the way I expected the result , Please help ... For instance , First line in printed output - Minimum seq should be 21 not 24 Third line in printed output - Maximum seq should be 52 not 46

Desired Output:

## $2,$3,$4,$6,$8,$10,"start Missing Seq ($1), ",End Missing Seq ($1) ,Minimum Seq ($1),Maximum Seq ($1) ##

abc,JUN-12,JUN-12,1,ROTN0412,L7,25,27,21,28
abc,MAY-13,MAY-13,1350,ROTN0213,K1,33,37,32,38
abc,FEB-14,FEB-14,650,CHEN1113,S317,42,45,41,52
abc,FEB-14,FEB-14,650,CHEN1113,S317,47,50,41,52

Try to use the edit button and format the question a little bit. Like this it is impossible to read. — fedorqui 'SO stop harming'
Hakon , Thanks a lot for this lengthy script and your efforts, while running this i am getting this error, -bash-3.2$ perl Min_Max_MissingGap.pl Can't locate File/Slurp.pm in @INC (@INC contains: /usr/lib64/perl5/site_perl/5.8.8/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.7/x86_64-linux-thread-multi /usr/lib64/perl5/site_perl/5.8.6/x86_64-linux-thread-multi multi /usr/lib64/perl5/vendor_perl/5.8.5/ multi /usr/lib/perl5/5.8.8 .) at Min_Max_MissingGap.pl line 5. BEGIN failed--compilation aborted at Min_Max_MissingGap.pl line 5 — VNA

Håkon Hægland Håkon Hægland · Accepted Answer · 2014-05-15T17:01:43

You can try the following perl script:

#! /usr/bin/perl

use warnings;
use strict;
use File::Slurp qw(read_file);
use List::Util qw(min max);

my @lines=read_file('input.csv');

my $ll=sortLines(\@lines);

$ll=reduceFields($ll);

my $rr=findRanges($ll);

printMissingSeqs($rr,$ll);


sub printMissingSeqs { 
  my ($rr,$ll) = @_;

  my $pkey=""; my $pss; my $i=0; 
  for (@$ll) {
     my @f=split(/,/);
     my $key=$f[6];
     my $ss=$f[0];
     $pss=$ss if $i==0;
     if (($key eq $pkey) && ($ss-$pss)>1) {
        print join(",",(@f[1..6], $pss+1,$ss-1,@{$rr->{$key}}))."\n";
     }
     $pkey=$key; $pss=$ss;
     $i++;
  }
}

sub findRanges { 
  my ($ll) = @_;

  my %temp;
  my %rr;

  for (@$ll) {
     my @f=split(/,/);
     push (@{$temp{$f[6]}},$f[0]);
  }

  for (keys %temp) {
     my $min=min(@{$temp{$_}});
     my $max=max(@{$temp{$_}});
     $rr{$_}=[$min, $max];
  }
  return \%rr;
}

sub reduceFields { 
  my ($ll) = @_;

  my @a;
  for (@$ll) {
     my @f=split(/,/);
     my $line=join(",",($f[0],$f[1],substr($f[2],3,6),substr($f[3],3,6),$f[5],$f[7],$f[9]));
     push (@a,$line);
  }
  return \@a;
}


sub sortLines { 
  my ($lines) = @_;

  my @a=sort { my ($keyA)=$a=~/(.*?),/; my ($keyB)=$b=~/(.*?),/; $keyA<=>$keyB} @$lines;

  return \@a;
}

Output:

abc,JUN-12,JUN-12,1,RO0412,L7,25,27,21,28
abc,MAY-13,MAY-13,1350,RO0213,K1,33,37,32,38
abc,FEB-14,FEB-14,650,EN1113,S317,42,45,41,52
abc,FEB-14,FEB-14,650,EN1113,S317,47,50,41,52

awk to print Missing sequence gap and min-max values:

1 Answers