0
votes

I am analyzing a dataset with specific demographic information. These are the primary variables I am trying to wrangle and subset:

hh_id   is_head_of_household    married   gender   age
1          1                    single    male     28
1          0                    single    female   27
2          1                    married   male     33
2          0                    married   female   34
2          1                    single    male     6

I need to create a variable that indicates the type of household under these four specific categories: "single male head of household", "single female household", "married couple", "unmarried couple"

For example, each household has a unique ID and the first household represents an unmarried couple because there are at least two adults (18 and over) and at least one of them is head of household (1 or 0) and are both listed as "single" under the married column. The second household is a married couple because there are at least two adults, one of them is the head, and are listed as "married" under the married column. A "single male" or "single female" household would have AT MOST one male or female adult that's also the head of household. Any other individuals within the household would have to be kids (under 18).

I tried creating a column that would indicate one of these four categories for EACH UNIQUE household id using dplyr:

First, I created an adult or child category:

individual_data["adult"] <- NA
individual_data$adult <- ifelse(individual_data$age >= 18, "adult", 
"child")

This is the code I have so far for trying to create a variable for single households:

individual_data["if_adult"] <- ifelse(individual_data$age >= 18, "1","0")
library(dplyr)
individual_data %>% 
group_by(hh_id) %>% 
mutate(unmarried_couple = sum(if_adult*(married =="Single"))==1,
total_adults = sum(if_adult))  

This code does not yield the expected results and I'm unsure how to go about producing the other two categories. Ideally, my new dataset would look like this:

   hh_id   is_head_of_household    married   gender   age     type
   1          1                    single    male     28  unmarried couple
   1          0                    single    female   27  unmarried couple
   2          1                    married   male     33    married couple
   2          0                    married   female   34    married couple
   2          1                    single    male     6     married couple
   ..
   n          ----------------------------------------------------------

There should only be ONE categorization for EACH hh_id. How can I find a solution to this in dplyr?

data structure:

structure(list(hh_id = c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 
5L), person_id = 1:10, is_head_of_household = c(1L, 0L, 1L, 0L, 
0L, 1L, 0L, 1L, 0L, 1L), married = structure(c(2L, 2L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Married", "Single"), class = "factor"), 
gender = structure(c(2L, 5L, 2L, 5L, 5L, 2L, 5L, 2L, 3L, 
2L), .Label = c("F", "Female", "FEMALE", "M", "Male", "MALE"
), class = "factor"), race = structure(c(3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Asian", "Black", "White"
), class = "factor"), age = c(28L, 27L, 34L, 33L, 6L, 28L, 
29L, 30L, 3L, 30L), voted_in_2012 = c(0L, 1L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L), is_college_graduate = c(1L, 1L, 1L, 
0L, 1L, 1L, 0L, 1L, 0L, 1L), adult = c("adult", "adult", 
"adult", "adult", "child", "adult", "adult", "adult", "child", 
"adult")), row.names = c(NA, 10L), class = "data.frame")
1
Is "head of household" a synonym for "male" here? If so, can it be removed? It does not appear to be imparting any additional information. (It may just be that your data implies that accidentally - can a female be a head of a single or married household?).halfer
both females and males can be head of a single or married householdmanutd93
Right, thanks - it looked redundant from the data.halfer

1 Answers

0
votes

Edit: converted married to lowercase in case_when to capture situations where that variable is capitalized differently than sample data.

library(dplyr)
hh_types <- individual_data %>%
  filter(age >= 18) %>%  # only concerned with adults for categorization
  arrange(hh_id, -is_head_of_household) %>%   # bring head of hh to top
  group_by(hh_id) %>%              # For each hh_id...
  mutate(adult_count = n()) %>%    # ... how many adults
  slice(1) %>%                     # just keep the top row  (the head)
  ungroup() %>%

  mutate(category = case_when(
    tolower(married) == "married"   & adult_count > 1 ~ "married couple",
    tolower(married) == "single" & adult_count > 1 ~ "unmarried couple",
    adult_count == 1   ~ paste("single", gender, "head of household"),
    TRUE   ~  "Other")) %>%
  select(hh_id, category)


individual_data %>%
  left_join(hh_types)
#Joining, by = "hh_id"
#  hh_id is_head_of_household married gender age                        category
#1     1                    1  single   male  28                unmarried couple
#2     1                    0  single female  27                unmarried couple
#3     2                    1 married   male  33                  married couple
#4     2                    0 married female  34                  married couple
#5     2                    1  single   male   6                  married couple
#6     3                    1  single female  30 single female head of household
#7     4                    1  single   male  28   single male head of household

Sample data with single hh's added:

individual_data <- read.table(
  header = T,
  stringsAsFactors = F, 
  colClasses = c("integer", "integer", "character", "character", "integer"),
  text = "hh_id   is_head_of_household    married   gender   age
1          1                    single    male     28
1          0                    single    female   27
2          1                    married   male     33
2          0                    married   female   34
2          1                    single    male     6
3          1                    single    female   30
4          1                    single    male     28"
)