I found some related questions to mine that helped some, but all where different in a key part, so here it goes.
I have a data frame with some NA's:
type <- LETTERS[1:5]
a_pc <- c(3, NA, NA , 4, 5)
b_pc <- c(NA, 2, 7, 4, 5)
a_pc_mean <- rep(mean(a_pc, na.rm = TRUE), times = 5)
b_pc_mean <- rep(mean(b_pc, na.rm = TRUE), times = 5)
df <- data.frame(type, a_pc, b_pc, a_pc_mean, b_pc_mean)
> df
type a_pc b_pc a_pc_mean b_pc_mean
1 A 3 NA 4 4.5
2 B NA 2 4 4.5
3 C NA 7 4 4.5
4 D 4 4 4 4.5
5 E 5 5 4 4.5
I want to replace the NA's in the columns a_pc
and b_pc
with the values in their respective mean columns. I thought a clean way to do it was to use dplyr. My code so far is:
library(dplyr)
df2 <- df %>%
mutate_at(.vars = vars(ends_with("_pc")),
.funs = funs(replace(., is.na(.), ???)
Where I put the question marks I need to reference to the columns with the means, but I cannot figure out what. My understanding of dplyr is that the .
references the columns in vars(ends_with("_pc"))
so I tried to paste0 together .
and "_mean"
, but that didn't work. This question came close to mine, but it asked to replace by a fixed value, not a value from anther column.
My actual dataset has more then two columns in which I want to replace NA's , so I'd prefer not to reference them explicitly.
EDIT
My original question above didn't illustrate what I wanted to do, so to clarify I post a sample of my data:
> crime_pop
subregion iso year assault kidnapping pop assault_pc kidnapping_pc
<fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Caribbean ABW 2008 NA NA 101353 NA NA
2 Southern Asia AFG 2008 NA NA 27294031 NA NA
3 Middle Africa AGO 2008 NA NA 21759420 NA NA
4 Southern Europe ALB 2008 363 10 2947314 0.000123 0.00000339
5 Southern Europe AND 2008 105 0 83861 0.00125 0
6 Western Asia ARE 2008 631 672 6894278 0.0000915 0.0000975
7 South America ARG 2008 145240 NA 40382389 0.00360 NA
8 Western Asia ARM 2008 201 27 2908220 0.0000691 0.00000928
9 Caribbean ATG 2008 NA NA 92478 NA NA
10 Australia and New Zealand AUS 2008 68019 611 21249200 0.00320 0.0000288
My idea was to interpolate the NA's in assault and kidnapping (and the other variables in the actual dataset) by calculating the per capita crime rates of the countries without missing data, taking the sub-region averages of these and applying these to the countries with the missing data.
To calculate the per capita crime rates I used:
crime_pop <- crime_pop %>%
mutate_at(.vars = vars(assault:kidnapping),
.funs = funs(pc = . / pop))
The sub-region means can than be calculated using @Psidom 's answer:
crime_pop2 <- crime_pop %>%
group_by(year, subregion) %>%
mutate_at(vars(ends_with("_pc")),
funs(replace(., is.na(.), mean(., na.rm = TRUE))))
Now the NA's in assault
and kidnapping
need the be replaced by the product of pop
and assault_pc
, and pop
and kidnapping_pc
respectively, which brings me back to my original question of referencing other columns in the replace function when used in mutate_at
. Maybe there is an easier way to do all this in one go, I'm open to suggestions. Thanks!
mutate_at
like that as far as I can tell.vars
refers to the columns to apply a function to, but you basically want to apply a different function to each column. I suggest you look intomap2
to map over thepc
and themean
columns in parallel. – Calum Youa_pc_mean <- rep(mean(a, na.rm = TRUE), times = 5)
should bea_pc_mean <- rep(mean(a_pc, na.rm = TRUE)
, times = 5). Similarly forb_pc_mean
. – rgt47mean
call. – avs