3
votes

I am trying to convert my data.frame from a wide table to a long table with a frequency column.

data("UCBAdmissions")
ucb_admit <- as.data.frame(UCBAdmissions)
ucb_admit
   Admit    Gender Dept Freq
1  Admitted   Male    A  512
2  Rejected   Male    A  313
3  Admitted Female    A   89
4  Rejected Female    A   19
...  

And I want to gather this data (tidyr package, similar to melt from reshape) but use the Freq to specify how many times the row should be repeated.

My target data would therefore look something like:

     Admit      Gender Dept
1    Admitted   Male    A
2    Admitted   Male    A
3    Admitted   Male    A
4    Admitted   Male    A
5    Admitted   Male    A
6    Admitted   Male    A
...
4523 Rejected Female    F
4524 Rejected Female    F
4525 Rejected Female    F
4526 Rejected Female    F

I would like to use the tidyr::gather() to do this, however my results are not correct as I am unsure if/how to include the Freq column?

Thanks

2

2 Answers

2
votes

Here is a solution using dplyr, tidyr, and purrr.

library(dplyr)
library(tidyr)
library(purrr)

ucb_admit2 <- ucb_admit %>%
  mutate(Freq = map2(1, Freq, `:`)) %>%
  unnest() %>%
  select(-Freq)

Or use this similar approach, which only needs functions from dplyr and tidyr.

ucb_admit2 <- ucb_admit %>%
  rowwise() %>%
  mutate(Freq = list(seq(1, Freq))) %>%
  ungroup() %>%
  unnest() %>%
  select(-Freq)

Both of them adopt the same strategy: create a list column and then unnest it.

We can also consider use the separate_row function from tidyr to achieve this task.

ucb_admit2 <- ucb_admit %>%
  rowwise() %>%
  mutate(Freq = paste(seq(1, Freq), collapse = ",")) %>%
  ungroup() %>%
  separate_rows(Freq) %>%
  select(-Freq)

Benchmarking

I compared the two methods proposed by eipi10 and the three methods proposed by me, using the following microbenchmarking. The result shows that base R approach is the fastest, followed by the dplyr repeat and slice approach. So, I think unless there are other considerations, such as code readability, no need to use tidyr or purrr for this question.

library(microbenchmark)

library(microbenchmark)


microbenchmark(m1 = (ucb_admit[rep(1:nrow(ucb_admit), 
                                   ucb_admit$Freq), 
                               -grep("Freq", names(ucb_admit))]),
               m2 = (ucb_admit %>% 
                       slice(rep(1:n(), Freq)) %>% 
                       select(-Freq)),
               m3 = (ucb_admit %>%
                       mutate(Freq = map2(1, Freq, `:`)) %>%
                       unnest() %>%
                       select(-Freq)),
               m4 = (ucb_admit %>%
                       rowwise() %>%
                       mutate(Freq = list(seq(1, Freq))) %>%
                       ungroup() %>%
                       unnest() %>%
                       select(-Freq)),
               m5 = (ucb_admit %>%
                       rowwise() %>%
                       mutate(Freq = paste(seq(1, Freq), collapse = ",")) %>%
                       ungroup() %>%
                       separate_rows(Freq) %>%
                       select(-Freq)))

Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval
   m1  3.455026  3.585888  4.295322  3.845367  4.147506   8.60228   100
   m2  6.888881  7.541269  8.849527  8.031040  9.428189  15.53991   100
   m3 23.252458 24.959122 29.706875 27.414396 32.506805  61.00691   100
   m4 20.033499 21.914645 25.888155 23.611688 27.310155 101.15088   100
   m5 28.972557 31.127297 35.976468 32.652422 37.669135  64.43884   100
2
votes

This doesn't look like a job for gather as the data are aggregated, rather than wide. You can "disaggregate" the data using indexing by repeating the row indices Freq times for each row. Below are methods using base R and dplyr.

library(dplyr)

# Base R
ucb_admit_disagg = ucb_admit[rep(1:nrow(ucb_admit), ucb_admit$Freq), 
                             -grep("Freq", names(ucb_admit))]

# dplyr
ucb_admit_disagg = ucb_admit %>% 
  slice(rep(1:n(), Freq)) %>% 
  select(-Freq)

Here's a portion of the data frame. I've added ellipses to the output to mark breaks in the row sequence.

ucb_admit_disagg[c(1:6, 510:514, 4523:4526), ]
          Admit Gender Dept
1      Admitted   Male    A
1.1    Admitted   Male    A
1.2    Admitted   Male    A
1.3    Admitted   Male    A
1.4    Admitted   Male    A
1.5    Admitted   Male    A
...
1.509  Admitted   Male    A
1.510  Admitted   Male    A
1.511  Admitted   Male    A
2      Rejected   Male    A
2.1    Rejected   Male    A
...
24.313 Rejected Female    F
24.314 Rejected Female    F
24.315 Rejected Female    F
24.316 Rejected Female    F