Summing values in R based on column value with dplyr

Question

I have a data set that has the following information:

Subject    Value1    Value2    Value3      UniqueNumber
001        1         0         1           3
002        0         1         1           2
003        1         1         1           1

If the value of UniqueNumber > 0, I would like to sum the values with dplyr for each subject from rows 1 through UniqueNumber and calculate the mean. So for Subject 001, sum = 2 and mean = .67.

total = 0;
average = 0;
for(i in 1:length(Data$Subject)){
   for(j in 1:ncols(Data)){
   if(Data$UniqueNumber[i] > 0){
    total[i] = sum(Data[i,1:j])
    average[i] = mean(Data[i,1:j])
   }
}

Edit: I am only looking to sum through the number of columns listed in the 'UniqueNumber' column. So this is looping through every row and stopping at column listed in 'UniqueNumber'. Example: Row 2 with Subject 002 should sum up the values in columns 'Value1' and 'Value2', while Row 3 with Subject 003 should only sum the value in column 'Value1'.

You can try df %>% mutate(sum = ifelse(UniqueNumber > 0, rowSums(.[, 2:(length(.)-1)]), NA), mean = ifelse(UniqueNumber > 0, rowMeans(.[, 2:(length(.)-1)]), NA)). — tmfmnk
@tmfmnk I don't think your code will iterate through on the length of the UniqueNumber. It looks like my results are summing through the whole column and not stopping at the value of the column UniqueValue. — statsguyz

David Arenburg David Arenburg · Accepted Answer · 2019-02-11T14:04:54

Not a tidyverse fan/expert, but I would try this using long format. Then, just filter by row index per group and then run any functions you want on a single column (much easier this way).

library(tidyr)
library(dplyr)

Data %>% 
  gather(variable, value, -Subject, -UniqueNumber) %>% # long format
  group_by(Subject) %>% # group by Subject in order to get row counts
  filter(row_number() <= UniqueNumber) %>% # filter by row index
  summarise(Mean = mean(value), Total = sum(value)) %>% # do the calculations
  ungroup() 

## A tibble: 3 x 3
#  Subject  Mean Total
#     <int> <dbl> <int>
# 1       1 0.667     2
# 2       2 0.5       1
# 3       3 1         1

A very similar way to achieve this could be filtering by the integers in the column names. The filter step comes before the group_by so it could potentially increase performance (or not?) but it is less robust as I'm assuming that the cols of interest are called "Value#"

Data %>% 
  gather(variable, value, -Subject, -UniqueNumber) %>% #long format
  filter(as.numeric(gsub("Value", "", variable, fixed = TRUE)) <= UniqueNumber) %>% #filter
  group_by(Subject) %>% # group by Subject
  summarise(Mean = mean(value), Total = sum(value)) %>% # do the calculations
  ungroup()

## A tibble: 3 x 3
#  Subject  Mean Total
#     <int> <dbl> <int>
# 1       1 0.667     2
# 2       2 0.5       1
# 3       3 1         1

Just for fun, adding a data.table solution

library(data.table)

data.table(Data) %>% 
  melt(id = c("Subject", "UniqueNumber")) %>%
  .[as.numeric(gsub("Value", "", variable, fixed = TRUE)) <= UniqueNumber,
    .(Mean = round(mean(value), 3), Total = sum(value)),
    by = Subject]

#    Subject  Mean Total
# 1:       1 0.667     2
# 2:       2 0.500     1
# 3:       3 1.000     1

Summing values in R based on column value with dplyr

6 Answers