Using summarize_all with functions that both require and don't require na.rm=T argument

Question

Observations in my data are contained in groups, and I'm trying to get multiple summary statistics (e.g., mean, median, length, standard deviation) for each group using the summarize_all function.

The problem is that some functions (e.g., mean, median) require the na.rm=T argument, while others do not (e.g., n()). When I specify na.rm=T in summarize_all, it applies the na.rm argument to each function listed (below, mean and sd).

library(dplyr)

airquality %>% 
  select(Month, Ozone, Solar.R, Temp) %>%
  group_by(Month) %>%
  summarize_all(list(mean, sd), na.rm=T)

BUT, when it also applies it to n() when I include that function, which gives me the error: "Error: Evaluation error: unused arguments (Ozone, na.rm = TRUE)"

airquality %>% 
  select(Month, Ozone, Solar.R, Temp) %>%
  group_by(Month) %>%
  summarize_all(list(mean, sd, n), na.rm=T)

I'd also love to know how to get rid of the terrible column names that summarize_all creates when using more than one function. For example, in the first chunk of code I get column names like mpg_<S4: standardGeneric> and cyl_<S4: standardGeneric>

Are you sure the issue is the na.rm? if you take out the na.rm you still get that ozone error. I think the issue is that mean and sd take input arguments, whereas n() doesn't have any inputs. — Jacqueline Nolis

Jacqueline Nolis Jacqueline Nolis · Accepted Answer · 2019-08-06T22:06:50

As I mentioned in the comments of your question, I think that n() is causing a separate issue: it expects 0 inputs to the function, so I don't think you can use it in summarize_all. For the sake of argument about the na.rm let's assume you wanted to know about length

airquality %>% 
  select(Month, Ozone, Solar.R, Temp) %>%
  group_by(Month) %>%
  summarize_all(list(mean,sd,length),na.rm=T)

Error in .Primitive("length")(Ozone, na.rm = TRUE) : 2 arguments passed to 'length' which requires 1

One solution is to manually specify each of the functions that you want to have na.rm=T, and make them as anonymous functions:

airquality %>% 
  select(Month, Ozone, Solar.R, Temp) %>%
  group_by(Month) %>%
  summarize_all(list(mean = function(x) mean(x,na.rm=T), sd = function(x) sd(x,na.rm=T), length=length))

Also notice that the name of the item in the list changes how it shows up in the data frame when you're done. So let's say we want the first one to be called "cool":

airquality %>% 
  select(Month, Ozone, Solar.R, Temp) %>%
  group_by(Month) %>%
  summarize_all(list(cool = function(x) mean(x,na.rm=T), sd = function(x) sd(x,na.rm=T), length=length))

# A tibble: 5 x 10
  Month Ozone_cool Solar.R_cool Temp_cool Ozone_sd Solar.R_sd Temp_sd Ozone_length Solar.R_length Temp_length
  <int>      <dbl>        <dbl>     <dbl>    <dbl>      <dbl>   <dbl>        <int>          <int>       <int>
1     5       23.6         181.      65.5     22.2      115.     6.85           31             31          31
2     6       29.4         190.      79.1     18.2       92.9    6.60           30             30          30
3     7       59.1         216.      83.9     31.6       80.6    4.32           31             31          31
4     8       60.0         172.      84.0     39.7       76.8    6.59           31             31          31
5     9       31.4         167.      76.9     24.1       79.1    8.36           30             30          30

Using summarize_all with functions that both require and don't require na.rm=T argument

1 Answers