How to create a messy_impute() function that imputes NA values in messy data with mean or median?

Question

I have the following data frame for a student with homework and exam scores.

> student1
         UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Homework_6 Homework_7 Homework_8
10 582493224         59         99          88         10         66         90         50         80
   Homework_9 Homework_10 Exam_1 Exam_2 Exam_3 Section
10         16          NA     41     61     11       A

The Homework_10 score is missing, and I need to create a function to impute the NA value with mean or median.

The function messy_impute should have the following arguments:

data : data frame or tibble to be imputed.

center : whether to impute using mean or median.

margin : whether to use row or column to input value (1- use row 2-use column).

For example,

messy_impute(student1,mean,1) should print out 

> student1
         UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Homework_6 Homework_7 Homework_8
10 582493224         59         99          88         10         66         90         50         80
   Homework_9 Homework_10 Exam_1 Exam_2 Exam_3 Section
10         16          **62**    41     61     11       A

since the mean of the rest of the homework is 62.

And, if the mean of the columns (other students) in section A for homework 10 is 50, then

messy_impute(student1,mean,2) should print out 

> student1
         UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Homework_6 Homework_7 Homework_8
10 582493224         59         99          88         10         66         90         50         80
   Homework_9 Homework_10 Exam_1 Exam_2 Exam_3 Section
10         16          **50**    41     61     11       A

since the mean of columns in section A is 50.

Please note the if the margin is 2, then the calculation should be done with the same section.

I'm really stuck on this defining the function.

hello_friend hello_friend · Accepted Answer · 2020-08-02T14:59:53

Base R solution:

  # Define function to Impute a row-wise mean (assumes one observation per student):
row_wise_mean_impute <- function(df){
  grade_df <- df[,names(df) != "studid"]
  return(cbind(df[,c("studid"), drop = FALSE], 
         replace(grade_df, is.na(grade_df), apply(grade_df, 1, mean, na.rm = TRUE))))
}


# Apply function: 
row_wise_mean_impute(student1)

Data:

x <- c(rnorm(85, 50, 3), rnorm(15, 50, 15))
student1 <- cbind(studid = 1010101, data.frame(t(x)))
student1[, 10] <- NA_real_

How to create a messy_impute() function that imputes NA values in messy data with mean or median?

1 Answers