0
votes

I'm facing an issue in R which I have described below. I need the count and average of all columns (avoiding NA's) per group ID which satisfy a condition.

I have the dataset below and in it there is an additional column. "T"

    structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L), S1 = c(NA, 5L, 
1L, 2L, 4L, 2L), S2 = c(1L, 2L, 3L, 7L, NA, 11L), T = c(3L, 3L, 
3L, 5L, 2L, 2L)), class = "data.frame", row.names = c("1", "2", 
"3", "4", "5", "6"))

Is it possible to get the count and average of each column only when the column fulfills the condition (column value <= T value); (In the original dataset, there are more than just S1 and S2..)

The resulting dataframe should look like:

  ID S1.count S2.count S1.overall S2.overall S1.per S2.per
1 1  1        3        2          3          0.5    1.0  
2 2  1        0        1          1          1.0    0.0
3 3  1        0        2          1          0.5    0.0

S1.count & S2.count : Number of observations which are less than T respectively.

S1.overall & S2.overall: Total number of observations (avoiding NA's) in each column.

S1.per & S2.per: S1.count/S1.overall , S2.count/S2.overall.

2
what code do you have tried so far?Richard Telford

2 Answers

6
votes

You can use the aggregate() function to get information about the occurrence of specific values, given a group. Grouping your data to the group ID, you can construct the dataframe you just outlined quite easily:

data = data.frame("ID"=c(1,1,1,2,3,3),
                  "S1"=c(NA,5,1,2,4,2),
                  "S2"=c(1,2,3,7,NA,11),
                  "T"=c(3,3,3,5,2,2))

newdata = data.frame("ID"=unique(data$ID),
                "S1.count"=aggregate(c(data$S1<=data$T)~data$ID,FUN=sum)[,2],
                "S2.count"=aggregate(c(data$S2<=data$T)~data$ID,FUN=sum)[,2],
                "S1.overall"=aggregate(c(!is.na(data$S1))~data$ID,FUN=sum)[,2],
                "S2.overall"=aggregate(c(!is.na(data$S2))~data$ID,FUN=sum)[,2])

newdata$S1.per = newdata$S1.count/newdata$S1.overall
newdata$S2.per = newdata$S2.count/newdata$S2.overall

I worked with logical vectors, here, to count the number of valid and smaller than T values. The sum of a logical vector is the number of TRUE elements.

The output of this short program would be:

> newdata
  ID S1.count S2.count S1.overall S2.overall S1.per S2.per
1  1        1        3          2          3    0.5      1
2  2        1        0          1          1    1.0      0
3  3        1        0          2          1    0.5      0
2
votes

Using dplyr, you can calculate this as follows :

library(dplyr)

df %>%
  group_by(ID) %>%
  summarise(across(starts_with('S'), list(count = ~sum(. <= T, na.rm = TRUE), 
                                          overall = ~sum(!is.na(.)), 
                                          per = ~mean(. <= T, na.rm = TRUE)))) %>%
  select(ID, ends_with('count'), ends_with('overall'), everything())

#     ID S1_count S2_count S1_overall S2_overall S1_per S2_per
#  <int>    <int>    <int>      <int>      <int>  <dbl>  <dbl>
#1     1        1        3          2          3    0.5      1
#2     2        1        0          1          1    1        0
#3     3        1        0          2          1    0.5      0

data

df <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L), S1 = c(NA, 5L, 
1L, 2L, 4L, 2L), S2 = c(1L, 2L, 3L, 7L, NA, 11L), T = c(3L, 3L, 
3L, 5L, 2L, 2L)), class = "data.frame", row.names = c(NA, -6L))