11
votes

I have a data frame where several columns may have the same name. In this small example, both column "A" and "G" occur twice:

    A  C  G  A  G  T
1   1 NA NA NA  1 NA
2   1 NA  5  3  1 NA
3  NA  1 NA NA NA  1
4  NA NA  1  2 NA NA
5  NA NA  1  1 NA NA
6  NA  1 NA NA NA  1
7  NA  1 NA NA NA  1

I wish to create a data set with one column per column name. For each row, the individual column values should be replaced with the sum (sum(..., na.rm = TRUE)) of the values within each column name. For example, in row two, the two individual "A" values (1 and 3) should be replaced with 4. I don't know in advance which column names that occur several times.

The expected output would then be:

#     A  C  G  T
# 1   1  0  1  0
# 2   4  0  6  0
# 3   0  1  0  1
# 4   2  0  1  0
# 5   1  0  1  0
# 6   0  1  0  1
# 7   0  1  0  1

So I guess I could do something like:

noms = colnames(dat)
for(x in noms[duplicated(noms)]) {
  dat[ , x] = rowSums(dat[ , x == noms], na.rm = TRUE)
}
dat = dat[,!duplicated(noms)]

But this is a bit clunky and for loops are meant to be evil. Is there any way to do this more simply?

2
You want to sum by row columns with the same names?David Arenburg
@DavidArenburg Yes, that's correct.user1356855

2 Answers

17
votes

We can transpose dat , calculate rowsum per group (colnames of the original dat), then transpose the result back to original structure.

t(rowsum(t(dat), group = colnames(dat), na.rm = T))
#  A C G T
#1 1 0 1 0
#2 4 0 6 0
#3 0 1 0 1
#4 2 0 1 0
#5 1 0 1 0
#6 0 1 0 1
#7 0 1 0 1
1
votes

We could split the dataframe by name using split.default and take the row-wise sum using rowSums to create one column for each unique name.

sapply(split.default(df, names(df)), rowSums, na.rm = TRUE)

#  A C G T
#1 1 0 1 0
#2 4 0 6 0
#3 0 1 0 1
#4 2 0 1 0
#5 1 0 1 0
#6 0 1 0 1
#7 0 1 0 1