AWK post-procession of multi-column data

Question

I am working with the set of txt file containing multi column information present in one line. Within my bash script I use the following AWK expression to take the filename from each of the txt filles as well as the number from the 5th column and save it in 2 column format in results.CSV file (piped to SED, which remove path of the file and its extension from the final CSV file):

awk '-F, *' '{if(FNR==2) printf("%s| %s \n", FILENAME,$5) }' ${tmp}/*.txt | sed 's|\/Users/gleb/Desktop/scripts/clusterizator/tmp/||; s|\.txt||'  >> ${home}/"${experiment}".csv

obtaining something (for 5 txt filles) like this as CSV:

lig177_cl_5.2| -0.1400 
lig331_cl_3.5| -8.0000 
lig394_cl_1.9| -4.3600 
lig420_cl_3.8| -5.5200 
lig550_cl_2.0| -4.3200

How it would be possible to modify my AWK expression in order to exclude "_cl_x.x" from the name of each txt file as well as add the name of the CSV as the comment to the first line of the resulted CSV file:

# results.CSV
lig177| -0.1400 
lig331| -8.0000 
lig394| -4.3600 
lig420| -5.5200 
lig550| -4.3200

karakfa karakfa · Accepted Answer · 2021-02-09T15:04:59

based on the rest of the pipe, I think you want to do something like this and get rid of sed invocations.

awk -F', *' 'FNR==2 {f=FILENAME; 
                     sub(/.*\//,"",f);
                     sub(/_.*/ ,"",f);
                     printf("%s| %s\n", f, $5) }' "${tmp}"/*.txt >> "${home}/${experiment}.csv"

this will convert

/Users/gleb/Desktop/scripts/clusterizator/tmp/lig177_cl_5.2.txt

to

lig177

The pattern replacement is generic

/path/to/the/file/filename_otherstringshere...

will extract only filename. From the last / char to the first _ char. This is based the greedy matching of regex patterns.

For the output filename, it's easier to do it before awk call, since it's a one line only.

$ echo "${experiment}.csv" > "${home}/${experiment}.csv"
$ awk ... >> "${home}/${experiment}.csv"

AWK post-procession of multi-column data

1 Answers