14
votes

I am trying to create one table that summarizes several categorical variables (using frequencies and proportions) by another variable. I would like to do this using the dplyr package.

These previous Stack Overflow discussions have partially what I am looking for: Relative frequencies / proportions with dplyr and Calculate relative frequency for a certain group.

Using the mtcars dataset, this is what the output would look like if I just wanted to look at the proportion of gear by am category:

    mtcars %>%
    group_by(am, gear) %>%
    summarise (n = n()) %>%
    mutate(freq = n / sum(n))

    #   am gear  n      freq
    # 1  0    3 15 0.7894737
    # 2  0    4  4 0.2105263
    # 3  1    4  8 0.6153846
    # 4  1    5  5 0.3846154

However, I actually want to look at not only the gears by am, but also carb by am and cyl by am, separately, in the same table. If I amend the code to:

    mtcars %>%
    group_by (am, gear, carb, cyl) %>%
    summarise (n = n()) %>%
    mutate(freq = n / sum(n))

I get the frequencies for each combination of am, gear, carb, and cyl. Which is not what I want. Is there any way to do this with dplyr?

EDIT

Also, it would be an added bonus if anyone knew of a way to produce the table I want, but with the categories of am as the columns (as in a classic 2x2 table format). Here is an example of what i'm referring to. It is from one of my previous publications. I want to produce this table in R, so that I can output it directly to a word document using RMarkdown:

enter image description here

3
Is there a reason it has to be done in dplyr? And is one of the 'groups' always the same? (Here it's am)Heroka
And can you give an example of the table you want? It's certainly possible with some reshaping, but I'm not sure what you're after.Heroka
There is no super important reason it has to be done in dplyr, except that i'm trying to learn the package very well, so that I have a consistent method available for producing my tables. Another driving reason for dplyr is that it produces a data frame as output, which allows me to use the stargazer package for producing publication worthy tables that I can then output to a word document using RMarkdown. I am, of course, open to alternative methods that others think are better for doing this.RNB
I am having a similar problem. How did you construct your table in the end @RNB?Frederick

3 Answers

7
votes

With tidyr/dplyr combination, here is how you would do it:

library(tidyr)
library(dplyr)

mtcars %>%
  gather(variable, value, gear, carb, cyl) %>%
  group_by(am, variable, value) %>%
  summarise (n = n()) %>%
  mutate(freq = n / sum(n))
11
votes

One way to solve this, is to turn your data to a long(er) format. You can then use the same code to calculate the outcomes you want, with one extra group_by:

library(reshape2)
library(dplyr)

m_mtcars <- melt(mtcars,measure.vars=c("gear","carb","cyl"))

res <- m_mtcars %>%
  group_by(am, variable, value) %>%
  summarise (n = n()) %>%
  mutate(freq = n / sum(n))

Building on this, the desired output can be obtained using more reshaping and some string formatting

#make an 'export' variable
res$export <- with(res, sprintf("%i (%.1f%%)", n, freq*100))

#reshape again
output <- dcast(variable+value~am, value.var="export", data=res, fill="missing") #use drop=F to prevent silent missings 
#'silent missings'
output$variable <- as.character(output$variable)
#make 'empty lines' 
empties <- data.frame(variable=unique(output$variable), stringsAsFactors=F)
empties[,colnames(output)[-1]] <- ""

#bind them together
output2 <- rbind(empties,output)
output2 <- output2[order(output2$variable,output2$value),]

#optional: 'remove' variable if value present

output2$variable[output2$value!=""] <- ""

This results in:

   variable value          0         1
2      carb                           
7               1  3 (15.8%) 4 (30.8%)
8               2  6 (31.6%) 4 (30.8%)
9               3  3 (15.8%)   missing
10              4  7 (36.8%) 3 (23.1%)
11              6    missing  1 (7.7%)
12              8    missing  1 (7.7%)
3       cyl                           
13              4  3 (15.8%) 8 (61.5%)
14              6  4 (21.1%) 3 (23.1%)
15              8 12 (63.2%) 2 (15.4%)
1      gear                           
4               3 15 (78.9%)   missing
5               4  4 (21.1%) 8 (61.5%)
6               5    missing 5 (38.5%)
-1
votes

An alternative to group by and then summarize is to use a count().

This just makes the code 1 line more concise

library(reshape2)
library(dplyr)

m_mtcars <- melt(mtcars,measure.vars=c("gear","carb","cyl"))

res <- m_mtcars %>%
  count(am, variable, value) %>%
  mutate(freq = n / sum(n))

The other benefit is that this will save the other values that are lost in a group_by summarize. The resulting table looks like this

enter image description here