1
votes

I am trying to summarize groups in a large data frame using a function which works when nrow length of a group = 3, but produces NA's for all groups in a given column when the function is applied to any groups that have an nrow length = 1 or 2.

Below is a snapshot of the dataset I am working with. This data frame is the output of the group_by function which I have used to group samples by zones (0's or 1's) within an ID class 'ID'. The resultant groups are in the 'grp' column. For this dataset I am only interested in groups which have a Zone value of 1 (i.e. Groups 2, 4 and 7). Groups with Zone values of 1 are either 1 or 3 rows long.

> df
    ID From To Count Mean_Value Zone
1  ID1    0 48    15 0.04927333    0
2  ID1    0 48    18 0.06273333    0
3  ID1   45 48     3 1.02333333    1
4  ID1   48 49     1 0.10300000    1
5  ID1   49 51     2 0.71500000    1
6  ID1   51 58     7 0.10285714    0
7  ID1   58 59     1 0.55000000    1
8  ID1   59 62     3 0.26333333    0
9  ID2    0 36     9 0.05211111    0
10 ID2   33 34     1 0.70000000    1
11 ID2   34 50     9 0.11666667    0
> grp_test <- group_by(df, grp = data.table::rleid(ID, Zone))
> grp_test
# Groups:   grp [8]
   ID     From    To Count Mean_Value  Zone   grp
   <fct> <int> <int> <int>      <dbl> <int> <int>
 1 ID1       0    48    15     0.0493     0     1
 2 ID1       0    48    18     0.0627     0     1
 3 ID1      45    48     3     1.02       1     2
 4 ID1      48    52     1     0.103      1     2
 5 ID1      49    51     2     0.715      1     2
 6 ID1      51    58     7     0.103      0     3
 7 ID1      58    59     1     0.55       1     4
 8 ID1      59    62     3     0.263      0     5
 9 ID2       0    36     9     0.0521     0     6
10 ID2      33    34     1     0.7        1     7
11 ID2      34    50     9     0.117      0     8

I am now trying to summarize the groups to obtain the mean of each group (grp) with a Zone value of 1 by taking into account the 'Count' and 'Mean_Value' for each row, when computing the mean for a group which has 3 rows. I cannot simply use the 'mean' function as the mean value in each row could have varying counts. In the example above the first row from grp 2 comprises 3 counts with a mean of 1.02, the second row from grp 67 has a count of 1 with a mean value of 0.103 and the third row has a count of 2 and mean value of 0.715. The mean value for this group should be 0.767167. Below is the code used to calculate the mean when a group has length nrow = 3. The problem is that this function only works on groups with nrow length 3, and provides NA values for the entire column when there are any groups which have nrow lengths of 1 or 2. How can I summarize groups with nrow length = 3 as well as groups with nrow lengths of 1 or 2?

df_summarise <- summarise(grp_test, ID = first(ID),
                 From = first(From), 
                 To =  last(To), 
                 Number_of_Intervals = n(),
                 Average = (((first(Count) * first(Mean_Value)) + (nth(Count, 2) * nth(Mean_Value, 2)) + (last(Count) * last(Mean_Value))) / (sum(Count))), 
                 Zone = first(Zone))

The expected outcome is below

   ID From To Count Mean_Value Zone grp
1 ID1   45 51     6  0.7671667    1   2
2 ID1   58 59     1  0.5500000    1   4
3 ID2   33 34     1  0.7000000    1   7
3
can you show the expected outputakrun

3 Answers

3
votes

You're looking for the weighted mean? I.e.:

df %>% group_by(data.table::rleid(ID, Zone), ID, Zone) %>%
  summarise(
    From = min(From),
    To = max(To),
    Number_of_Intervals = n(),
    Average = weighted.mean(Mean_Value, Count)
  )
1
votes

Yes, I think @MrGumble is right, you are looking for weighted.mean.

Another way to calculate that is :

library(dplyr)

grp_test %>%
  filter(Zone == 1) %>%
  summarise(From = min(From),
            To = max(To),
            Number_of_Intervals = n(),
            Average = sum(Mean_Value * Count)/sum(Count))

#   grp  From    To Number_of_Intervals Average
#  <int> <int> <int>               <int>   <dbl>
#1     2    45    51                   3   0.767
#2     4    58    59                   1   0.55 
#3     7    33    34                   1   0.7  
1
votes

We can also use data.table

library(data.table)
setDT(df)[, .(From = min(From),
               To = max(To),
                Number_of_Intervals = .N,
              Average = weighted.mean(Mean_Value, Count)), 
         .(grp = rleid(ID, Zone), ID, Zone)]