0
votes

I have a csv file which I'll be using as input with a format looking like this:

xValue,value1-avg,value1-median,value2-avg,value3-avg,value3-median
1,3,4,20,14,20

The key attributes of the input file are that each "value" will have a variable number of statistics, but the statistic type and "value" will always be separated by a "-". I then want to output the statistics of all the "values" to separate csv files.

The output would then look something like this:

value1.csv

xvalue,value1-avg,value1-median
1,3,4

value2.csv

xvalue,value2-avg
1,20

I've tried finding solutions to this, but all I can find are ways to copy by the column number, not the header name. I need to be able to use the header names to append the associated statistics to each of the output csv files.

Any help is greatly appreciated!

P.S. the output file may have already been written to during previous runs of this script, meaning the code should append to the output file

3

3 Answers

2
votes

Untested but should be close:

awk -F, '
NR==1 {
    for (i=2;i<=NF;i++) {
        outfile = $i
        sub(/-.*/,".csv",outfile)
        outfiles[i] = outfile
    }
}
{
    delete(outstr)
    for (i=2;i<=NF;i++) {
        outfile = outfiles[i]
        outstr[outfile] = outstr[outfile] FS $i
    }
    for (outfile in outstr)
        print $1 outstr[outfile] >> outfile
}
' inFile.csv

Note that deleting a whole array with delete(outstr) is gawk-specific. With other awks you can use split("",outstr) to get the same effect.

Note that this appends the output you wanted to existing files BUT that means you'll get the header line repeated on every execution. If that's an issue, tell us how to know when to generate the header line or not but the solution I THINK you'll want would look something like this:

awk -F, '
NR==1 {
    for (i=2;i<=NF;i++) {
        outfile = $i
        sub(/-.*/,".csv",outfile)
        outfiles[i] = outfile
    }
    for (outfile in outfiles) {
        exists[outfile] = ( ((getline tmp < outfile) > 0) && (tmp != "") )
        close(outfile)
    }
}
{
    delete(outstr)
    for (i=2;i<=NF;i++) {
        outfile = outfiles[i]
        outstr[outfile] = outstr[outfile] FS $i
    }
    for (outfile in outstr)
        if ( (NR > 1) || !exists[outfile] )
            print $1 outstr[outfile] >> outfile
}
' inFile.csv
0
votes

Just figure out the name associated with each column and use that mapping to manipulate the columns. If you're trying to do this in awk, you can use associative arrays to store the column names and the rows those correspond to. If you're using ksh93 or bash, you can use associative arrays to store the column names and the rows those correspond to. If you're using perl or python or ruby or ... you can...

Or push the columns into an array to map the numbers to column numbers.

Either way, then you have a list of column headers, which can further be manipulated however you need to.

0
votes

The solution I have found most useful to this kind of problem is to first retrieve the column number using an AWK script (encapsulated in a shell function) and then follow with a cut statement. This technique/strategy turns into a very concise, general and fast solution that can take advantage of co-processing. The non-append case is cleaner, but here is an example that handles the complication of the append you mentioned:

#! /bin/sh
fields() {
        LC_ALL=C awk -F, -v pattern="$1" '{
                j=0; split("", f)
                for (i=1; i<=NF; i++) if ($(i) ~ pattern) f[j++] = i
                if (j) {
                        printf("%s", f[0])
                        for (i=1; i<j; i++) printf(",%s", f[i])
                }
                exit 0
        }' "$2"
}
cut_fields_with_append() {
        if [ -s "$3" ]
        then
                cut -d, -f `fields "$1" "$2"` "$2" | sed '1 d' >> "$3"
        else
                cut -d, -f `fields "$1" "$2"` "$2" > "$3"
        fi
}
cut_fields_with_append '^[^-]+$|1-' values.csv value1.csv &
cut_fields_with_append '^[^-]+$|2-' values.csv value2.csv &
cut_fields_with_append '^[^-]+$|3-' values.csv value3.csv &
wait

The result is as you would expect:

$ ls
values  values.csv
$ cat values.csv 
xValue,value1-avg,value1-median,value2-avg,value3-avg,value3-median
1,3,4,20,14,20
$ ./values
$ ls
value1.csv  value2.csv  value3.csv values  values.csv
$ cat value1.csv
xValue,value1-avg,value1-median
1,3,4
$ cat value2.csv
xValue,value2-avg
1,20
$ cat value3.csv 
xValue,value3-avg,value3-median
1,14,20
$ ./values
$ cat value1.csv 
xValue,value1-avg,value1-median
1,3,4
1,3,4
$ cat value2.csv 
xValue,value2-avg
1,20
1,20
$ cat value3.csv 
xValue,value3-avg,value3-median
1,14,20
1,14,20
$