Tabulate responses for multiple columns by grouping variable with dplyr

Question

Hi:I'm new to the plyr/dplyr family but enjoying it. I can see it's massive utility for my own work, but I'm stil trying to get my head around it.
I have a data frame that looks like below.

1) How do I produce a table for each non-grouping variable that shows the distribution of responses within each value of the grouping variable?

2) Note: I do have some missing values and I would like to exclude them from the tabulation. I realize the summarize_each command will apply the function to each column, but I don't know how to handle the missing values issue in a simple way. I have seen some codes that suggest you have to filter out missing values, but what if the missing values are scattered randomly through the non-grouping variables?

3) Fundamentally, is it best to just use complete cases with dplyr?

#library
library(dplyr)
#sample data
group<-sample(c('A', 'B', 'C'), 100, replace=TRUE)
var1<-sample(c(1,2,3,4,5,NA), 100, replace=TRUE,     prob=c(0.15,0.15,0.15,0.15,0.15,0.25))
var2<-sample(c(1,2,3,4,5,NA), 100, replace=TRUE, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))
var3<-sample(c(1,2,3,4,5,NA), 100, replace=TRUE, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))
df<-data.frame(group, var1, var2, var3)
#my code
out_df<-df %>%group_by(group)
out_df %>% summarise_each(funs(table))

eipi10 eipi10 · Accepted Answer · 2015-04-14T17:11:07

You can get counts by group for each of var1, var2, and var3 if you "melt" your data frame into long form first, which will "stack" the three var columns into a single column (value) and then create an additional column (variable) marking which rows go with which var.

library(dplyr)
library(reshape2)

#sample data
group <- sample(c('A', 'B', 'C'), 100, replace=TRUE)
var1 <- sample(c(1,2,3,4,5,NA), 100, replace=TRUE, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))
var2 <- sample(c(1,2,3,4,5,NA), 100, replace=TRUE, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))
var3 <- sample(c(1,2,3,4,5,NA), 100, replace=TRUE, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))

df<-data.frame(group, var1, var2, var3)

out_df <- df %>% 
  melt(id.var="group") %>%
  filter(!is.na(value)) %>%  # Remove NA
  group_by(group, variable, value) %>%
  summarise(count=n()) %>% 
  group_by(group, variable) %>% 
  mutate(percent=count/sum(count))

You can stop the function chain at any point to look at the intermediate steps, which will help in understanding what each step is doing.

Because we grouped by group, variable, and value, we end up with count giving us the number of rows for combination of those three columns. Then we group only by group and variable to calculate the percentage of rows that each value of count contributes to each combination of the two grouping variables. (The second group_by is not essential, because dplyr drops the last grouping variable after a summarise operation (because there will only be one row for each combination of all the original grouping variables) but I prefer to regroup explicitly.)

Here's the final result:

out_df

   group variable value count    percent
1      A     var1     1     6 0.26086957
2      A     var1     2     3 0.13043478
3      A     var1     3     6 0.26086957
4      A     var1     4     1 0.04347826
5      A     var1     5     7 0.30434783
...
41     C     var3     1     6 0.25000000
42     C     var3     2     5 0.20833333
43     C     var3     3     4 0.16666667
44     C     var3     4     2 0.08333333
45     C     var3     5     7 0.29166667

Tabulate responses for multiple columns by grouping variable with dplyr

2 Answers