Create indicator variable for NA values in a dataframe using index values for larger dataset

Question

I have a dataframe with around 300 features and 1million observations. I have created a list variable with has the index values that contain 80% of data as NA values.

My index list has -> 2,4 I want to create an indicator variable for the columns with index 2 ,3 of dataframe to replace NA values with "0" and other values in the column with "1"

I tried to loop through each row but as data is huge its taking long time to loop it.

Input dataframe -> df

row col1 col2 col3
a NA 1 3
a NA 1 NA
a 2 2 NA

Expected output:
row col1 col2 col3
a 0 1 1
a 0 1 0
a 1 2 0

Can anyone point me to the right direction to achieve this faster.

Thanks,
Renuka

Paul Campbell Paul Campbell · Accepted Answer · 2018-07-11T12:31:18

You can use dplyr::mutate_at to select the columns you want to change then apply a case_when function to recode NAs as 0 and anything else as 1 which should be a lot quicker than a for loop.

library(dplyr)

df %>%  
  mutate_at(vars(col1, col3), funs(
    case_when(
      is.na(.) ~ 0,
      TRUE ~ 1
  )))

Create indicator variable for NA values in a dataframe using index values for larger dataset

1 Answers