R reconstruct data

Question

I want to reshape my data but couldn't figure out how because it's not exactly from wide to long.

Here's a sample of my data, population by single year of age and by sex

data <- structure(list(`2010 - Both Sexes - 0` = 163753, `2010 - Male - 0` = 83878, `2010 - Female - 0` = 79875, `2011 - Both Sexes - 0` = 161923, 
                       `2011 - Male - 0` = 83134, `2011 - Female - 0` = 78789, `2010 - Both Sexes - 1` = 163043, 
                       `2010 - Male - 1` = 83174, `2010 - Female - 1` = 79869, `2011 - Both Sexes - 1` = 163342, 
                       `2011 - Male - 1` = 83472, `2011 - Female - 1` = 79870), row.names = c(NA, 
                         -1L), class = c("tbl_df", "tbl", "data.frame"))

My desired dataset would look like the following:

 age  2010 - Both Sexes   2010 - Male   2010 - Female  2011 - Both Sexes 2011 - Male   2011 - Female ...
  0                                     
  1

                 ...

Can anyone help? Thanks.

1. Where does age come from? Not all columns have - 0 or - 1. 2. That's looks like an awful (untidy) output format. Is that really want you want? — Maurits Evers
@MauritsEvers age would be a new column, columns that don't have 0 or 1 are the total population. I just deleted them. — Eric Wang
I would really urge you to reconsider the format. This is the opposite of tidy. That aside, you've got a solution below. — Maurits Evers
Fair enough. BTW, the definition of "tidy data" is pretty unambiguous and defined in Tidy Data by Hadley Wickham, Journal of Statistical Software 59 (2014). — Maurits Evers
@MauritsEvers thanks for sharing this. Good to know that someone has defined it. — Eric Wang

Calum You Calum You · Accepted Answer · 2018-08-01T22:17:39

I don't know why you would ever want this format, but you can do it with tidyr like so:

gather all the columns into colnames and populations;
separate out the age from the other two by splitting on the second -;
spread back out the population values.

library(tidyverse)
data <- structure(list(`2010 - Both Sexes - 0` = 163753, `2010 - Male - 0` = 83878, `2010 - Female - 0` = 79875, `2011 - Both Sexes - 0` = 161923, `2011 - Male - 0` = 83134, `2011 - Female - 0` = 78789, `2010 - Both Sexes - 1` = 163043, `2010 - Male - 1` = 83174, `2010 - Female - 1` = 79869, `2011 - Both Sexes - 1` = 163342, `2011 - Male - 1` = 83472, `2011 - Female - 1` = 79870), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))

data %>%
  gather(year_sex_age, population) %>%
  separate(year_sex_age, c("year_sex", "age"), sep = " - (?=0|1)") %>%
  spread(year_sex, population)
#> # A tibble: 2 x 7
#>   age   `2010 - Both Se~ `2010 - Female` `2010 - Male` `2011 - Both Se~
#>   <chr>            <dbl>           <dbl>         <dbl>            <dbl>
#> 1 0               163753           79875         83878           161923
#> 2 1               163043           79869         83174           163342
#> # ... with 2 more variables: `2011 - Female` <dbl>, `2011 - Male` <dbl>

Created on 2018-08-01 by the reprex package (v0.2.0).

R reconstruct data

1 Answers