1
votes

I have a data.frame that looks like this:

    # A tibble: 2,003 x 16
   barcost barrulesplay barrulessch barrulesrelax barrulesinjury barriskskills barraincold barrainsick barrainmessy barraininjury barrainparentdis… barrainchilddis… barrainchildclo…
     <int>        <int>       <int>         <int>          <int>         <int>       <int>       <int>        <int>         <int>             <int>            <int>            <int>
 1       3            4           3             4              4             4          NA          NA           NA            NA                NA               NA               NA
 2       2            5           5             5              3             5          NA          NA           NA            NA                NA               NA               NA
 3       2            2           2             3              2             4          NA          NA           NA            NA                NA               NA               NA
 4       2            4           4             4              2             4          NA          NA           NA            NA                NA               NA               NA
 5       2            3           3             4              2             4          NA          NA           NA            NA                NA               NA               NA
 6       2            4           4             4              3             4          NA          NA           NA            NA                NA               NA               NA
 7       3            5           5             4              2             4          NA          NA           NA            NA                NA               NA               NA
 8       4            5           5             4              4             3          NA          NA           NA            NA                NA               NA               NA
 9       1            5           5             5              3             5          NA          NA           NA            NA                NA               NA               NA
10       2            4           4             4              3             4          NA          NA           NA            NA                NA               NA               NA

When I use the "describe" function form hmisc as follows, I get a list of lists (as expected):

describe(questions)

enter image description here

Here I can see the data I want to extract and plot is in "frequency" under "values" of this list of lists.

How would I create a tidy data.frame which, for every column has the frequencies of 1's, 2's, 3's etc which is in the list output form the "describe" function above?:

summary[["barcost"]][["values"]]

$value
[1] 1 2 3 4 5

$frequency
[1] 348 806 410 360  79

So a data.frame that has the column headers as variables (under a column names "questions" for example) and then (using the example of the "barcost" questions above) 348 1's, 806 2's etc all for the "barcost" question variable.

I am aware that I may be trying to do something very complex when there is a simpler way of achieving the same goal, so open to suggestions.

1
@rawr, yes this will do it for a specific "question" but I would like to be able to do it for all them, and get a table that is in tidy format (for graphing) so has for example "barcost" listed 2003 times: 348 times with 1, 806 times with 2 etc, and then for all of them.reubenmcg

1 Answers

2
votes

You can get frequencies by column more directly. gather will convert the data to "long" format, which facilitates tabulation by group.

library(tidyverse)

freq = gather(questions) %>% group_by(key, value) %>% tally

Then you can graph the results, for example, like this:

ggplot(freq, aes(value, n)) +
  geom_col() +
  facet_wrap(~ key)

If we start with the output of describe, you could do this:

freq = map_df(describe(questions), ~.x$values, .id="Column")

However, describe doesn't return frequencies for columns with less than three unique values, so this approach would exclude any such columns from the resulting freq data frame.

UPDATE: If I understand your comment, here's a way to color based on proportions of values:

# Fake data
set.seed(2)
dat = replicate(10, sample(1:5, 50, replace=TRUE))

# Get frequencies and proportions
freq = dat %>% as.data.frame %>% 
  gather() %>% 
  group_by(key, value) %>% 
  tally %>% 
  mutate(pct=n/sum(n))

ggplot(freq, aes(value, n, fill=pct)) +
  geom_col() +
  facet_wrap(~ key, ncol=5) +
  scale_fill_gradient(low="red", high="blue")

enter image description here