Factor Variable Labelling but proportionally

Question

I am organizing data set and have a problem regarding to factor variables. I have a gender variable total count of 3246 and most of them are males. I have 50 NA's in the gender category. I do not want to delete the observations with NA's but also do not want to replace all of it with the 'male' or 'female'. I would like to change randomly 7 of the NA's to 'female' and 43 to 'male'. However, I could not manage it.

I already know how to change NA's to one type.

data$Gender[is.na(data$Gender)] = 'male'

jay.sf jay.sf · Accepted Answer · 2021-09-06T11:36:35

You may subset the variable using is.na() into an na object, create a sample out of the value universe of length of the sum of the TRUE's in na, and replace the subset with the new sample. Here an example:

## example data
n <- 1e3
set.seed(42)
x <- sample(c('f', 'm'), n, replace=TRUE)
x[sample(length(x), 50)] <- NA
table(x, useNA="ifany")
# x
#     f    m <NA> 
#   476  474   50 

## solution 1
u <- unique(na.omit(x))  ## value universe  
na <- is.na(x)  ## subset variable
x[na] <- sample(u, sum(na), replace=TRUE)  ## new sample
table(x, useNA="ifany")  ## result
# x
#   f   m 
# 504 496

You may also use the proportions of the non-missing data to create females and males with like so:

## solution 2 (Note: Create example data again from above)
p <- proportions(table(x))  ## proportions
na <- is.na(x)  ## subset variable
x[na] <- sample(names(p), sum(na), replace=TRUE, prob=p)  ## new sample
table(x, useNA="ifany")  ## result
# x
#   f   m 
# 500 500

Factor Variable Labelling but proportionally

1 Answers