1
votes

In some fields, there are multiple items separated by comma. I want to use AWK to split them into different rows, but within the original column not effecting the other columns.

The input looks like this:

Gene1  human1,human2  dog1  cat1,cat2
Gene2                 dog2  cat3
Gene3  human3               cat4,cat5

Expected output is:

Gene1  human1  dog1  cat1
Gene1  human2        cat2
Gene2          dog2  cat3
Gene3  human3        cat4
Gene3                cat5

(The separator of columns is tab)

3
What have you tried yourself? - oguz ismail
Are your fields REALLY fixed-width as shown in your question or is your input actually some other format (e.g. tab-separated or semi-colon separated or comma-separated with the fields quoted) and you're showing it in a tabular way to try to help us visualize your data? If your real data is not fixed width fields then edit your example to truly show your real data format as THAT is what we have to help you write code to parse. - Ed Morton
I see you updated your question to say the separator is tabs. Its FAR better if you choose a visible character as your separator when posting examples. If you showed, say, ; instead of tab as the separator in your examples then you could get a solution for ; and simply change the ; in the code to tab before running it on your real data. That would remove all questions about what the separator is is and where it occurs for us reading your example and trying to help you. - Ed Morton
Thank you a lot for the suggestion. I see the confusion. But if ; is used as separator, the table (especially for the input where lengths are various) becomes messy and hard to interpret. So I think the current way (put them in nice table but indicate the separator) is the best way. - XiaokangZH
@XiaokangZH for those of us trying to help you, having a separator we can see is vastly better than having one we can't see wrt understanding the example and writing code to parse the actual input. Putting the data in a nice table in addition to showing the real data can be helpful but it's no replacement for seeing the real format of the data when trying to write a tool to parse that format. I promise if you used semicolons in your example no-one would have a hard time understanding it - it'd just be much clearer, unambiguous, and something we could simply copy/paste to test against. - Ed Morton

3 Answers

2
votes
$ cat a.awk
#!/bin/awk -f
{
    max_n = 0
    # output 4 column data
    n = split($2, arr2 ,",");
    if( max_n < n ) max_n = n

    n = split($3, arr3 ,",");
    if( max_n < n ) max_n = n

    n = split($4, arr4 ,",");
    if( max_n < n ) max_n = n

    for( j = 1; j <= max_n ; j++)
        printf("%s\t%s\t%s\t%s\n", $1, arr2[j], arr3[j],arr4[j]);
}

$ awk -F'\t' -f a.awk a.txt

Gene1   human1  dog1    cat1
Gene1   human2          cat2
Gene2           dog2    cat3
Gene3   human3          cat4
Gene3                   cat5

note: field-separator should be the '\t'

1
votes

Alternative using perl:

$ perl -F"\t" -MList::MoreUtils=zip6 -lane '
  BEGIN { $, = "\t" }
  @c2 = split /,/, $F[1];
  @c3 = split /,/, $F[2];
  @c4 = split /,/, $F[3];
  print $F[0], @$_ for zip6 @c2, @c3, @c4;' input.tsv
Gene1   human1  dog1    cat1
Gene1   human2          cat2
Gene2           dog2    cat3
Gene3   human3          cat4
Gene3                   cat5

This uses the non-core List::MoreUtils module, installable through your OS package manager (Debian/Ubuntu package is liblist-moreutils-perl) or CPAN client of choice (cpanm List::MoreUtils etc.).

It also assumes that columns are separated by tabs.

1
votes

Assuming your input is fixed width as shown in your example then using GNU awk for FIELDWIDTHS:

$ cat tst.awk
BEGIN { FIELDWIDTHS="7 15 6 *"; OFS="\t" }
{
    delete vals
    numCols = NF
    for (colNr=1; colNr<=numCols; colNr++) {
        numRows = split($colNr,f,/,/)
        for (rowNr=1; rowNr<=numRows; rowNr++) {
            val = f[rowNr]
            gsub(/^[[:space:]]+|[[:space:]]+$/,"",val)
            vals[rowNr,colNr] = val
        }
    }

    for (rowNr=1; rowNr<=numRows; rowNr++) {
        printf "%s", vals[1,1]
        for (colNr=2; colNr<=numCols; colNr++) {
            printf "%s%s", OFS, vals[rowNr,colNr]
        }
        print ""
    }
}

$ awk -f tst.awk file
Gene1   human1  dog1    cat1
Gene1   human2          cat2
Gene2           dog2    cat3
Gene3   human3          cat4
Gene3   human2          cat5

Obviously the above will work for any number of fields in your input. If the input is tab-separated or anything else then replace FIELDWIDTHS=... with FS=whatever-your-separator-is. If you want the output to just look tabular regardless of what the fields contain rather than be tab-separated then pipe it to column -s$'\t' -t or use printf with a width instead of print.

Using semi-colons as the separator so you can see them (again just set FS and OFS to be whatever you actually use):

$ cat file
Gene1;human1,human2;dog1;cat1,cat2
Gene2;;dog2;cat3
Gene3;human3;;cat4,cat5

$ cat tst.awk
BEGIN { FS=OFS=";" }
{
    delete vals
    numCols = NF
    for (colNr=1; colNr<=numCols; colNr++) {
        numRows = split($colNr,f,/,/)
        for (rowNr=1; rowNr<=numRows; rowNr++) {
            val = f[rowNr]
            vals[rowNr,colNr] = val
        }
    }

    for (rowNr=1; rowNr<=numRows; rowNr++) {
        printf "%s", vals[1,1]
        for (colNr=2; colNr<=numCols; colNr++) {
            printf "%s%s", OFS, vals[rowNr,colNr]
        }
        print ""
    }
}

$ awk -f tst.awk file
Gene1;human1;dog1;cat1
Gene1;human2;;cat2
Gene2;;dog2;cat3
Gene3;human3;;cat4
Gene3;;;cat5