0
votes

I have a number of files with the same header:

COL1, COL2, COL3, COL4

You can ignore COL1-COL3. COL4 contains a number. Each file contains about 200 rows. I am trying to sum up across the rows. For example:

File 1

COL1 COL2 COL3 COL4
 x    y   z    3
 a    b   c    4

File 2

COL1 COL2 COL3 COL4
 x     y    z   5 
 a     b    c   10 

Then a new file is returned:

COL1 COL2 COL3 COL4
 x     y    z   8 
 a     b    c   14

Is there a simple way to do this without AWK? I will use AWK if need be, I just thought there might be a simple one-liner that I could just run right away. The AWK script I have in mind feels a bit long.

Thanks

5
Are the COL1-3 same in all the files? Do they appear in the same order in all the files? - choroba
awk would be a good and optimal choice. Of course, not with one short line, but ... with 2 lines, yes, I would write it with 2 lines - RomanPerekhrest
You're thinking backwards - in any given text manipulation situation where there might be a length or complicated solution, awk IS the simpler way. - Ed Morton

5 Answers

1
votes

One more option.

The command:

paste f{1,2}.txt | sed '1d' | awk '{print $1,$2,$3,$4+$8}' | awk 'BEGIN{print "COL1","COL2","COL3","COL4"}1'

The result:

COL1 COL2 COL3 COL4
x y z 8
a b c 14

What it does:

Test files:

$ cat f1.txt
COL1 COL2 COL3 COL4
 x    y   z    3
 a    b   c    4

$ cat f2.txt
COL1 COL2 COL3 COL4
 x     y    z   5
 a     b    c   10

Command: paste f{1,2}.txt
Joins 2 files and gives output:

COL1 COL2 COL3 COL4 COL1 COL2 COL3 COL4
 x    y   z    3     x     y    z   5
 a    b   c    4     a     b    c   10

Command: sed '1d'
Is meant to remove header temporarily

Command: awk '{print $1,$2,$3,$4+$8}'
Returns COL1-3 and sums $4 and $8 from paste result.

Command: awk 'BEGIN{print "COL1","COL2","COL3","COL4"}1'
Adds header back

EDIT:
Following @mklement0 comment, he is right about header handling as I forgot the NR==1 part.

So, I'll proxy his updated version here also:

paste f{1,2}.txt | awk '{ print $1, $2, $3, (NR==1 ? $4 : $4 + $8) }'
2
votes

Combining paste with awk, as in Kristo Mägi's answer, is your best bet:

  • paste merges the corresponding lines from the input files,
  • which sends a single stream of input lines to awk, with each input line containing all fields to sum up.

Assuming a fixed number of input files and columns, Kristo's answer can be simplified to (making processing much more efficient):

paste file1 file2 | awk '{ print $1, $2, $3, (NR==1 ? $4 : $4 + $8) }'

Note: The above produces space-separated output columns, because awk's default value for OFS, the output-field separator, is a single space.


Assuming that all files have the same column structure and line count, below is a generalization of the solution, which:

  • generalizes to more than 2 input files (and more than 2 data rows)
  • generalizes to any number of fields, as long as the field to sum up is the last one.
#!/bin/bash

files=( file1 file2 ) # array of input files
paste "${files[@]}" | awk -v numFiles=${#files[@]} -v OFS='\t' '
  {
    row = sep = ""
    for(i=1; i < NF/numFiles; ++i) { row = row sep $i; sep = OFS }
    sum = $(NF/numFiles) # last header col. / (1st) data col. to sum
    if (NR > 1) { for(i=2; i<=numFiles; ++i) sum += $(NF/numFiles * i) } # add other cols.
    printf "%s%s%s\n", row, OFS, sum
  }
'

Note that \t (the tab char.) is used to separate output fields and that, due to relying on awk's default line-splitting into fields, preserving the exact input whitespace between fields is not guaranteed.

2
votes

If all files have the same header - awk solution:

awk '!f && FNR==1{ f=1; print $0 }FNR>1{ s[FNR]+=$NF; $NF=""; r[FNR]=$0 }
      END{ for(i=2;i<=FNR;i++) print r[i],s[i] }' File[12]

The output (for 2 files):

COL1 COL2 COL3 COL4
x y z 8
a b c 14

This approach can be applied to multiple files (in that case you may specify globbing File* for filename expansion)

1
votes

You state you have "a number of files". i.e., more than 2.

Given these 3 files (and should work with any number):

$ cat f1 f2 f3
COL1 COL2 COL3 COL4
 x    y   z    3
 a    b   c    4
COL1 COL2 COL3 COL4
 x     y    z   5 
 a     b    c   10 
COL1 COL2 COL3 COL4
 x     y    z   10 
 a     b    c   15 

You can do:

$ awk 'FNR==1{next}
     {sum[$1]+=$4}
     END{print "COL1 COL4"; 
         for (e in sum) print e, sum[e]} ' f1 f2 f3
COL1 COL4
x 18
a 29

It is unclear what you intend to do with COL2 or COL3, so I did not add that.

0
votes
$ awk '
     NR==1 { print }
     { sum[FNR]+=$NF; sub(/[^[:space:]]+[[:space:]]*$/,""); pfx[FNR]=$0 }
     END { for(i=2;i<=FNR;i++) print pfx[i] sum[i] }
' file1 file2
COL1 COL2 COL3 COL4
 x     y    z   8
 a     b    c   14

The above will work robustly and efficiently with any awk on any UNIX system, with any number of input files and with any contents of those files. The only potential problem would be that it has to retain the equivalent of 1 of those files in memory so if each file was absolutely massive then you may exhaust available memory.