I am analyzing a dataset with specific demographic information. These are the primary variables I am trying to wrangle and subset:
hh_id is_head_of_household married gender age
1 1 single male 28
1 0 single female 27
2 1 married male 33
2 0 married female 34
2 1 single male 6
I need to create a variable that indicates the type of household under these four specific categories: "single male head of household", "single female household", "married couple", "unmarried couple"
For example, each household has a unique ID and the first household represents an unmarried couple because there are at least two adults (18 and over) and at least one of them is head of household (1 or 0) and are both listed as "single" under the married column. The second household is a married couple because there are at least two adults, one of them is the head, and are listed as "married" under the married column. A "single male" or "single female" household would have AT MOST one male or female adult that's also the head of household. Any other individuals within the household would have to be kids (under 18).
I tried creating a column that would indicate one of these four categories for EACH UNIQUE household id using dplyr:
First, I created an adult or child category:
individual_data["adult"] <- NA
individual_data$adult <- ifelse(individual_data$age >= 18, "adult",
"child")
This is the code I have so far for trying to create a variable for single households:
individual_data["if_adult"] <- ifelse(individual_data$age >= 18, "1","0")
library(dplyr)
individual_data %>%
group_by(hh_id) %>%
mutate(unmarried_couple = sum(if_adult*(married =="Single"))==1,
total_adults = sum(if_adult))
This code does not yield the expected results and I'm unsure how to go about producing the other two categories. Ideally, my new dataset would look like this:
hh_id is_head_of_household married gender age type
1 1 single male 28 unmarried couple
1 0 single female 27 unmarried couple
2 1 married male 33 married couple
2 0 married female 34 married couple
2 1 single male 6 married couple
..
n ----------------------------------------------------------
There should only be ONE categorization for EACH hh_id. How can I find a solution to this in dplyr?
data structure:
structure(list(hh_id = c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L,
5L), person_id = 1:10, is_head_of_household = c(1L, 0L, 1L, 0L,
0L, 1L, 0L, 1L, 0L, 1L), married = structure(c(2L, 2L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Married", "Single"), class = "factor"),
gender = structure(c(2L, 5L, 2L, 5L, 5L, 2L, 5L, 2L, 3L,
2L), .Label = c("F", "Female", "FEMALE", "M", "Male", "MALE"
), class = "factor"), race = structure(c(3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Asian", "Black", "White"
), class = "factor"), age = c(28L, 27L, 34L, 33L, 6L, 28L,
29L, 30L, 3L, 30L), voted_in_2012 = c(0L, 1L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 1L), is_college_graduate = c(1L, 1L, 1L,
0L, 1L, 1L, 0L, 1L, 0L, 1L), adult = c("adult", "adult",
"adult", "adult", "child", "adult", "adult", "adult", "child",
"adult")), row.names = c(NA, 10L), class = "data.frame")