1
votes

I have ID's with data per year, but the years are repeated throughout the columns:

test1<-data.frame(
  ID=c("P1","P1","P2","P2","P1","P2","P1","P1") , 
  YEAR1 =c(10,30,50,40,50,45,12,8), 
  YEAR2=c(40,20,30,10,50,30,60,10), 
  YEAR3=c(300,200,170,150,150,120,90,100), 
  YEAR2= c(100,10,20,30,50,60,40,80))

and what I need to sum the data per year and get something like this:

result <- data.frame(
  ID=c("P1","P2") , 
  YEAR1 =c(110,135), 
  YEAR2=c(460,180), 
  YEAR3=c(840,440))
result
#  ID YEAR1 YEAR2 YEAR3
#1 P1   110   460   840
#2 P2   135   180   440

I've tried it with aggregate:

result <- aggregate(test1, by=list(test1$ID), FUN = sum)

But my problem is that does not sum the years that are equal, maybe is because years now is an attribute?

3

3 Answers

0
votes

An option in base R would be to split the data in to list of data.frames based on the similarity in column names and do the aggregate within the list and Reduce by mergeing with the common 'ID' column

lst1 <- lapply(split.default(test1[-1], 
     sub("\\.\\d+$", "", names(test1)[-1])), function(x) 
     aggregate(.~ ID, cbind(test1[1], colN = unlist(x)), sum))
lst1 <-  Map(function(x, y, z) setNames(x, c(y, z)), lst1, "ID", names(lst1))

Reduce(function(...) merge(..., by = "ID"), lst1)
#    ID YEAR1 YEAR2 YEAR3
#1 P1   110   460   840
#2 P2   135   180   440

Also, if there is only a single column that is duplicated,, then just do

transform(aggregate(.~ ID, test1, sum), YEAR2 = YEAR2 + YEAR2.1)[-5]
#   ID YEAR1 YEAR2 YEAR3
#1 P1   110   460   840
#2 P2   135   180   440
0
votes

You could try something like this:

library(tidyverse)
test1 %>% 
  gather(key = "yr", value = "value", -ID) %>% 
  mutate(yr = str_remove(yr, "\\.\\d+")) %>% 
  group_by(ID, yr) %>% 
  summarize(value = sum(value))

  ID    yr    value
  <fct> <chr> <dbl>
1 P1    YEAR1   110
2 P1    YEAR2   460
3 P1    YEAR3   840
4 P2    YEAR1   135
5 P2    YEAR2   180
6 P2    YEAR3   440

The gather operation creates a modified data frame with three columns: ID, year, and value. Then you can modify the year column to remove the .1, .2 etc. from the duplicate column names and perform a summary operation to get the totals by ID and year.

0
votes

Once you get the first part of the column names (except for ID), you can subset the relevant columns, add the rows together, and then use tapply to sum the resulting value by ID.

nms = names(test1)[-1]
nms = split(nms, sapply(strsplit(nms, "\\."), "[", 1))
sapply(nms, function(nm){
    tapply(rowSums(test1[nm]), test1["ID"], sum)
})
#   YEAR1 YEAR2 YEAR3
#P1   110   460   840
#P2   135   180   440