0
votes

I am having trouble formatting a large 2D matrix into a "sparse matrix" format. I hope someone here can help.

Due to the large file size, an awk script should be prefered.

Here is a hypothetical set from the data (tab delimited):

Regions      string8-0      string8-5000      string8-10000
string8-0        0                2.1                1.15    
string8-5000     0                0                  2.4  
string8-10000    0                4                  3.1

So, the resulted output should be a tab separeted file with four columns. Column1 = the numeric part of rownames (colomn1 from data; example: 0, 5000, 1000); Column2 = is the corresponding numeric part of the colnames (example: 0, 5000, 10000), Column3 = the value between the two bin. Column4 = string part of column names (string8, string8, string8).

And only print this if the corresponding value is not zero.

like this:

0      5000    2.1    string8
0      10000   1.15   string8
5000   10000   2.4    string8
10000  5000    4      string8
10000  10000   3.1    string8

Sorry, if it is a redundant question, but I could not find any solutions.

Thanks, Laszlo

1
You say you are having trouble --- what have you tried and what trouble in particular are you having?jas

1 Answers

0
votes

EDIT: Now outputs separate lines for multiple "hits" in input row

$ cat sp.awk
BEGIN { FS = OFS = "\t" }

NR == 1 {
    for (c=2; c<=NF; ++c) {
        split($c, col_parts, "-")
        col_str[c] = col_parts[1]
        col_num[c] = col_parts[2]
    }
    next
}

{
    split($1, parts, "-");
    col1 = parts[2]
    for (c=2; c<=NF; ++c) {
        if ($c != 0) {
            print col1, col_num[c], $c, col_str[c]
        }
    }
}

$ awk -f sp.awk sp.txt
0       5000    2.1     string8
0       10000   1.15    string8
5000    10000   2.4     string8
10000   5000    4       string8
10000   10000   3.1     string8