awk: split a field into multiple rows within one column

Question

In some fields, there are multiple items separated by comma. I want to use AWK to split them into different rows, but within the original column not effecting the other columns.

The input looks like this:

Gene1  human1,human2  dog1  cat1,cat2
Gene2                 dog2  cat3
Gene3  human3               cat4,cat5

Expected output is:

Gene1  human1  dog1  cat1
Gene1  human2        cat2
Gene2          dog2  cat3
Gene3  human3        cat4
Gene3                cat5

(The separator of columns is tab)

Are your fields REALLY fixed-width as shown in your question or is your input actually some other format (e.g. tab-separated or semi-colon separated or comma-separated with the fields quoted) and you're showing it in a tabular way to try to help us visualize your data? If your real data is not fixed width fields then edit your example to truly show your real data format as THAT is what we have to help you write code to parse. — Ed Morton
I see you updated your question to say the separator is tabs. Its FAR better if you choose a visible character as your separator when posting examples. If you showed, say, ; instead of tab as the separator in your examples then you could get a solution for ; and simply change the ; in the code to tab before running it on your real data. That would remove all questions about what the separator is is and where it occurs for us reading your example and trying to help you. — Ed Morton
Thank you a lot for the suggestion. I see the confusion. But if ; is used as separator, the table (especially for the input where lengths are various) becomes messy and hard to interpret. So I think the current way (put them in nice table but indicate the separator) is the best way. — XiaokangZH
@XiaokangZH for those of us trying to help you, having a separator we can see is vastly better than having one we can't see wrt understanding the example and writing code to parse the actual input. Putting the data in a nice table in addition to showing the real data can be helpful but it's no replacement for seeing the real format of the data when trying to write a tool to parse that format. I promise if you used semicolons in your example no-one would have a hard time understanding it - it'd just be much clearer, unambiguous, and something we could simply copy/paste to test against. — Ed Morton

gud auk gud auk · Accepted Answer · 2019-05-05T10:54:12

$ cat a.awk
#!/bin/awk -f
{
    max_n = 0
    # output 4 column data
    n = split($2, arr2 ,",");
    if( max_n < n ) max_n = n

    n = split($3, arr3 ,",");
    if( max_n < n ) max_n = n

    n = split($4, arr4 ,",");
    if( max_n < n ) max_n = n

    for( j = 1; j <= max_n ; j++)
        printf("%s\t%s\t%s\t%s\n", $1, arr2[j], arr3[j],arr4[j]);
}

$ awk -F'\t' -f a.awk a.txt

Gene1   human1  dog1    cat1
Gene1   human2          cat2
Gene2           dog2    cat3
Gene3   human3          cat4
Gene3                   cat5

note: field-separator should be the '\t'

awk: split a field into multiple rows within one column

3 Answers