0
votes

I'm trying to summarize multiple columns using summarize_at() with a custom function. The part I'm stuck on is the function ssmd() is meant to take a vector of values from the group established by group_by() and another vector of values from outside this group.

In the example below, x should be a vector for each set of values by Month (varies according to the current group), and y should be a fixed set of values for Month == 5.

# custom function
ssmd <- function(x, y){
  (mean(x, na.rm = TRUE) - mean(y, na.rm = TRUE)) / sqrt(var(x, na.rm = TRUE) + var(y, na.rm = TRUE))
}

# dataset
d <- airquality

# this isn't working - trying to find the difference between the mean for each Month and the mean of Month 5, for columns Ozone, Solar.R, Wind, and Temp
d %>%
  group_by(Month) %>%
  summarize_at(vars(Ozone:Temp), funs(ssmd, x = ., y = .[Month == 5])) %>%
  ungroup()

At the moment, this gives the following error: Error in mean(y, na.rm = TRUE) : argument "y" is missing, with no default. So I think I have a syntax error, in addition to being stuck on how to access values from outside the current group.

The expected output is a data frame with one row for each Month and one column for each variable (Ozone, Solar.R, Wind, and Temp).

2

2 Answers

1
votes

There are two issues :

1) When you are referring to Month in funs it is only for that group and not entire dataframe

2) 1) Can be resolved using .$Month but you don't have access to entire column in summarize_at to subset only those values where Month == 5.

However, you don't need that custom function, you can take mean of all columns for each Month and then subtract the values from each column where Month = 5.

library(dplyr)

d %>%
  group_by(Month) %>%
  summarize_at(vars(Ozone:Temp), mean, na.rm = TRUE) %>%
  mutate_at(vars(Ozone:Temp), ~.  - .[Month == 5])

# A tibble: 5 x 5
#  Month Ozone Solar.R  Wind  Temp
#  <int> <dbl>   <dbl> <dbl> <dbl>
#1     5  0       0     0      0  
#2     6  5.83    8.87 -1.36  13.6
#3     7 35.5    35.2  -2.68  18.4
#4     8 36.3    -9.44 -2.83  18.4
#5     9  7.83  -13.9  -1.44  11.4

To use ssmd function in the updated post we can do :

library(dplyr)
library(purrr)

named_info <- d %>% select(Ozone:Temp) %>% names()

map(named_info, function(x) d %>% group_by(Month) %>% 
                     summarise_at(vars(x), ~ssmd(., d[[x]][d$Month == 5]))) %>%
    reduce(inner_join, by = 'Month')
1
votes

I don't know how to fix your syntax error, but I proposed a workaround here. This summarizes the data as monthly mean for each column, and then just subtract the first value, which is the mean of May.

library(dplyr)

d <- airquality

d1 <- d %>%
  group_by(Month) %>%
  summarize_at(vars(Ozone:Temp), list(~mean(., na.rm = TRUE))) %>%
  ungroup()

d1[-1] <- lapply(d1[-1], function(x) x - x[1])

d1
# # A tibble: 5 x 5
#   Month Ozone Solar.R  Wind  Temp
#   <int> <dbl>   <dbl> <dbl> <dbl>
# 1     5  0       0     0      0  
# 2     6  5.83    8.87 -1.36  13.6
# 3     7 35.5    35.2  -2.68  18.4
# 4     8 36.3    -9.44 -2.83  18.4
# 5     9  7.83  -13.9  -1.44  11.4