Merge CSV files using join/awk/sed

2

votes

Could you please help me to find THE bash command which will join/merge those following cvs files "template.csv + file1.csv + file2.csv + file3.csv + ... + fileX.csv" into "ouput.csv".

For each line in template.csv, concatenate associated values (if exist) listed in the fileX.csv as below:

template.csv:

header
1
2
3
4
5
6
7
8
9

file1.csv:

header,value1
2,value12
3,value13
7,value17
8,value18
9,value19

file2.csv:

header,value2
1,value21
2,value22
3,value23
4,value24

file3.csv:

header,value3
2,value32
4,value34
6,value36
7,value37
8,value38

output.csv:

header,value1,value2,value3
1,,value21,
2,value12,value22,value32
3,value13,value23,
4,,value24,value34
5,,,
6,,,value36
7,value17,,value37
8,value18,,value38
9,value19,,

My template file is containing 35137 lines.
I already developed a bash script doing this merge (based on "do while", etc...) but the performance is not good at all. Too long to make the output.csv. I'm sure that it is possible to do the same using join, awk, ... but I don't see how ...

IMPORTANT UPDATE

The first column of my real files are containing a datetime and not a simple number ... so the script must take into account the space between the date and the time ... sorry for the update !

Script should be now designed with the below csv files as example:

template.csv:

header
2000-01-01 00:00:00
2000-01-01 00:15:00
2000-01-01 00:30:00
2000-01-01 00:45:00
2000-01-01 01:00:00
2000-01-01 01:15:00
2000-01-01 01:30:00
2000-01-01 01:45:00
2000-01-01 02:00:00

file1.csv:

header,value1
2000-01-01 00:15:00,value12
2000-01-01 00:30:00,value13
2000-01-01 01:30:00,value17
2000-01-01 01:45:00,value18
2000-01-01 02:00:00,value19

file2.csv:

header,value2
2000-01-01 00:00:00,value21
2000-01-01 00:15:00,value22
2000-01-01 00:30:00,value23
2000-01-01 00:45:00,value24

file3.csv:

header,value3
2000-01-01 00:15:00,value32
2000-01-01 00:45:00,value34
2000-01-01 01:15:00,value36
2000-01-01 01:30:00,value37
2000-01-01 01:45:00,value38

output.csv:

header,value1,value2,value3
2000-01-01 00:00:00,,value21,
2000-01-01 00:15:00,value12,value22,value32
2000-01-01 00:30:00,value13,value23,
2000-01-01 00:45:00,,value24,value34
2000-01-01 01:00:00,,,
2000-01-01 01:15:00,,,value36
2000-01-01 01:30:00,value17,,value37
2000-01-01 01:45:00,value18,,value38
2000-01-01 02:00:00,value19,,

bashcsvjoinawksed

4 Answers

3

votes

$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR { key[++numRows] = $1 }
{ fld[$1,ARGIND] = $NF }
END {
    for (rowNr=1; rowNr<=numRows; rowNr++) {
        for (colNr=1; colNr<=ARGIND; colNr++) {
            printf "%s%s", fld[key[rowNr],colNr], (colNr<ARGIND ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk template.csv file1.csv file2.csv file3.csv
header,value1,value2,value3
2000-01-01 00:00:00,,value21,
2000-01-01 00:15:00,value12,value22,value32
2000-01-01 00:30:00,value13,value23,
2000-01-01 00:45:00,,value24,value34
2000-01-01 01:00:00,,,
2000-01-01 01:15:00,,,value36
2000-01-01 01:30:00,value17,,value37
2000-01-01 01:45:00,value18,,value38
2000-01-01 02:00:00,value19,,

The above uses GNU awk for ARGIND, with other awks just add a line that says FNR==1 { ++ARGIND }.

1

votes

You could use multiple calls to join :

join -t , -a 1 -o auto template.csv file1.csv | join -t , -a 1 -o auto - file2.csv | join -t , -a 1 -o auto - file3.csv

Or more clearer :

alias myjoin='join -t , -a 1 -o auto'
myjoin template.csv file1.csv | myjoin - file2.csv | myjoin - file3.csv

Explanation :

-t , specifies the field separator (,)
-a 1 instructs to print unpairable lines coming from the first file (an assumption is made that the header file contains all possible headers)
-o auto controls formatting and is necessary to print the empty fields

Proof :

$ join -t , -a 1 -o auto template.csv file1.csv | join -t , -a 1 -o auto - file2.csv | join -t , -a 1 -o auto - file3.csv
header,value1,value2,value3
2000-01-01 00:00:00,,value21,
2000-01-01 00:15:00,value12,value22,value32
2000-01-01 00:30:00,value13,value23,
2000-01-01 00:45:00,,value24,value34
2000-01-01 01:00:00,,,
2000-01-01 01:15:00,,,value36
2000-01-01 01:30:00,value17,,value37
2000-01-01 01:45:00,value18,,value38
2000-01-01 02:00:00,value19,,

Note :

For this to work, the files MUST be sorted on the join fields (the header in your case). You can use the sort command if this is not the case.

1

votes

This should work (for explanation read the comments):

#!/bin/sh

awk -F, -v file=0 '
  FNR == 1 {                     # first line in the file
    if(file == 0) {              # if in first file (template.csv):
      header = $1                # init header
    } else {
      header = header "," $2     # else append field name
    }
    next                         # forward to next line.
  }
  file == 0 {                    # if in first file:
    key[FNR] = $1                # remember key
    next                         # next line.
  }
  {
    field[$1][file] = $2         # otherwise: remember field
  }
  ENDFILE {                      # at the end of a file:
    file = file + 1              # increase counter
  }
  END {                          # in the end, assemble and
    print header                 # print lines.
    asort(key)
    for(k in key) {
      line = ""
      for(i = 1; i < file; ++i) {
        line = line "," field[key[k]][i]
      }
      print key[k] line
    }
  }
  ' template.csv file1.csv file2.csv file3.csv

0

votes

I would go with this, however it surely is not the fastest running solution, but for your data it returns correct result and code is short:

#!/bin/bash
CONTENT=$(cat template.scv)
for line in $CONTENT; do
    TMP=$(echo $line)
    for file in file1.csv file2.csv file3.csv; do
        RESULT=$(grep "^$line," $file | cut -d',' -f2)
        TMP=$(echo $TMP,$RESULT)
    done
    echo $TMP
done

output:

header,value1,value2,value3
1,,value21,
2,value12,value22,value32
3,value13,value23,
4,,value24,value34
5,,,
6,,,value36
7,value17,,value37
8,value18,,value38
9,value19,,

EDIT: my code was missing a comma (,), so for longer ids it did not work properly EDIT 2: Well it is not "not the fastest solution", it is really slow one