Replace NA's in set of columns by values from another set of columns using dplyr

Question

I found some related questions to mine that helped some, but all where different in a key part, so here it goes.

I have a data frame with some NA's:

type <- LETTERS[1:5]
a_pc <- c(3, NA, NA , 4, 5)
b_pc <- c(NA, 2, 7, 4, 5)
a_pc_mean <- rep(mean(a_pc, na.rm = TRUE), times = 5)
b_pc_mean <- rep(mean(b_pc, na.rm = TRUE), times = 5)

df <- data.frame(type, a_pc, b_pc, a_pc_mean, b_pc_mean)

> df
  type a_pc b_pc a_pc_mean b_pc_mean
1    A    3   NA         4       4.5
2    B   NA    2         4       4.5
3    C   NA    7         4       4.5
4    D    4    4         4       4.5
5    E    5    5         4       4.5

I want to replace the NA's in the columns a_pc and b_pc with the values in their respective mean columns. I thought a clean way to do it was to use dplyr. My code so far is:

library(dplyr)

df2 <- df %>%
  mutate_at(.vars = vars(ends_with("_pc")),
            .funs = funs(replace(., is.na(.), ???)

Where I put the question marks I need to reference to the columns with the means, but I cannot figure out what. My understanding of dplyr is that the . references the columns in vars(ends_with("_pc")) so I tried to paste0 together . and "_mean", but that didn't work. This question came close to mine, but it asked to replace by a fixed value, not a value from anther column.

My actual dataset has more then two columns in which I want to replace NA's , so I'd prefer not to reference them explicitly.

EDIT

My original question above didn't illustrate what I wanted to do, so to clarify I post a sample of my data:

 > crime_pop
   subregion                 iso    year assault kidnapping      pop assault_pc kidnapping_pc
   <fct>                     <chr> <dbl>   <dbl>      <dbl>    <dbl>      <dbl>         <dbl>
 1 Caribbean                 ABW    2008      NA         NA   101353 NA           NA         
 2 Southern Asia             AFG    2008      NA         NA 27294031 NA           NA         
 3 Middle Africa             AGO    2008      NA         NA 21759420 NA           NA         
 4 Southern Europe           ALB    2008     363         10  2947314  0.000123     0.00000339
 5 Southern Europe           AND    2008     105          0    83861  0.00125      0         
 6 Western Asia              ARE    2008     631        672  6894278  0.0000915    0.0000975 
 7 South America             ARG    2008  145240         NA 40382389  0.00360     NA         
 8 Western Asia              ARM    2008     201         27  2908220  0.0000691    0.00000928
 9 Caribbean                 ATG    2008      NA         NA    92478 NA           NA         
10 Australia and New Zealand AUS    2008   68019        611 21249200  0.00320      0.0000288

My idea was to interpolate the NA's in assault and kidnapping (and the other variables in the actual dataset) by calculating the per capita crime rates of the countries without missing data, taking the sub-region averages of these and applying these to the countries with the missing data.

To calculate the per capita crime rates I used:

crime_pop <- crime_pop %>%
  mutate_at(.vars = vars(assault:kidnapping),
            .funs = funs(pc = . / pop))

The sub-region means can than be calculated using @Psidom 's answer:

crime_pop2 <- crime_pop %>%
  group_by(year, subregion) %>%
  mutate_at(vars(ends_with("_pc")),
            funs(replace(., is.na(.), mean(., na.rm = TRUE))))

Now the NA's in assault and kidnapping need the be replaced by the product of pop and assault_pc, and pop and kidnapping_pc respectively, which brings me back to my original question of referencing other columns in the replace function when used in mutate_at. Maybe there is an easier way to do all this in one go, I'm open to suggestions. Thanks!

Will write up a full answer when I get a chance, but you can't use mutate_at like that as far as I can tell. vars refers to the columns to apply a function to, but you basically want to apply a different function to each column. I suggest you look into map2 to map over the pc and the mean columns in parallel. — Calum You
possibly you have a small error in your code: a_pc_mean <- rep(mean(a, na.rm = TRUE), times = 5) should be a_pc_mean <- rep(mean(a_pc, na.rm = TRUE), times = 5). Similarly for b_pc_mean. — rgt47
@rgt47 Yes you are right. I changed the variable names to better reflect my actual data, but I forgot to change them in the mean call. — avs

Psidom Psidom · Accepted Answer · 2018-07-14T22:29:42

Simply use mean(., na.rm=TRUE) as the replacement:

df %>% mutate_at(vars(ends_with('_pc')), funs(replace(., is.na(.), mean(., na.rm=TRUE))))

#  type a_pc b_pc a_pc_mean b_pc_mean
#1    A    3  4.5         4       4.5
#2    B    4  2.0         4       4.5
#3    C    4  7.0         4       4.5
#4    D    4  4.0         4       4.5
#5    E    5  5.0         4       4.5

Or you can use coalesce that does the same thing, i.e. if values from . is NA, replace it with the mean:

df %>% mutate_at(vars(ends_with('_pc')), funs(coalesce(., mean(., na.rm=TRUE))))

#  type a_pc b_pc a_pc_mean b_pc_mean
#1    A    3  4.5         4       4.5
#2    B    4  2.0         4       4.5
#3    C    4  7.0         4       4.5
#4    D    4  4.0         4       4.5
#5    E    5  5.0         4       4.5

Replace NA's in set of columns by values from another set of columns using dplyr

2 Answers