dplyr pipeline: complex functions in summarise

Question

I used the following code to generate data.frame 'df' from my original data, 'pseudo'.

> df<-pseudo %>% group_by(Drug, CLSI_interpretation) %>% 
      summarise(n = n()) %>% 
      filter(Drug %in% c('Cefepime', 'Ceftazidime', 'Piperacillin','Piperacillin/tazobactam','Imipenem','Meropenem','Doripenem','Ciprofloxacin','Levofloxacin','Gentamicin','Tobramycin','Amikacin')) %>% 
      mutate(freq = (n/sum(n)*100))

Plus a very long mapvalues function that creates the 'class' column from 'Drug'.

All good so far; generates a dataset that looks like the following:

Drug         CLSI   n       freq        class
Amikacin        I   7213    4.25503047  Aminoglycosides
Amikacin        R   13995   8.25580915  Aminoglycosides
Amikacin        S   148309  87.48916038 Aminoglycosides
Cefepime        I   13326   8.87713502  Cephalosporins
Cefepime        R   9744    6.49098031  Cephalosporins  
Cefepime        S   127046  84.63188468 Cephalosporins
Ceftazidime     I   10836   5.98558290  Cephalosporins
Ceftazidime     R   15276   8.43814732  Cephalosporins
Ceftazidime     S   154923  85.57626978 Cephalosporins
Ciprofloxacin   I   8949    4.74295103  Fluoroquinolones
Ciprofloxacin   R   31563   16.72832309 Fluoroquinolones

I'm struggling with the next steps. I need to group this data by 'class', and for each class total the 'n' of CLSI %in% c('I','R') and generate a new frequency...basically, n(I + R)/n(I+R+S) and n(S)/n(I+R+S) for each class. Having a lot of trouble figuring out the summarise function because I need to summarise one variable (n) based on reference to another (CLSI), and keep grouped by a third (class). Thanks for your help.

In a chain like lhs %>% rhs, the %>% operator is "Placing lhs as the first argument in rhs call". Thus, in your case, the result of group_by(pseudo, class, CLSI) is used as the .data argument in summarize. When you add "pseudo", it is no longer the first argument, but one of the .... summarise is expecting a single value of each call in .... The 'result' of "pseudo" on the other hand is an entire data frame. Thus, the error. — Henrik
Are you looking for something like df %>% group_by(class) %>% mutate(prop.s = mean(CLSI == "S"), prop.ir = 1 - prop.s)? — aosmith
Thanks @aosmith but this just prints 1/3 and 2/3, as it's just measuring the proportion of the CLSI entries S vs I+R in each class, which is 1:2. I need it to sum the n's associated with these CLSI's. — jlev514
So, e.g., just mutate(propn.s = sum(n[CLSI == "S"])/sum(n)) for each group? — aosmith

Dieter Menne Dieter Menne · Accepted Answer · 2015-04-21T07:19:13

It's always good to show the complete code, including the reading of the data. Looks like pseudo is your data. The syntax of items in the %>% pipe is a little bit different from usual R, in that the first parameter is implicitly the pipe content. Or, simply: remove the "pseudo" from your calls.

library(dplyr)
pseudo = read.table("a.csv",header=TRUE)
pseudo <- pseudo %>%
  group_by(class, CLSI) %>% summarise(n= n())

dplyr pipeline: complex functions in summarise

1 Answers