1
votes

I have a data set like this:

Age <- rnorm(n=100, mean=20, sd=5)
ind <- which(Age %in% sample(Age, 50))
Age[ind]<-NA

Age2 <- rnorm(n=100, mean=20, sd=5)
ing <- which(Age2 %in% sample(Age2, 50))
Age2[ing]<-NA

Age3 <- rnorm(n=100, mean=20, sd=5)
int <- which(Age3 %in% sample(Age3, 50))
Age3[int]<-NA


data<-data.frame(Age,Age2,Age3)

Its an old data set several different people put together where multiple columns mean the same thing (there are several columns for age in the real data set). As you can see, there are quite a few NA's. I'd like to create a unified "age" column. To do this, I'd like to ideally use the number from the first age column, but if that is NA I'd then preferentially use the number from Age2, and if it is also NA i'd use Age3, and I'd like to do so in that order (Age3 would never supersede Age2, etc...) as I trust the people who input the data in that order haha.

I'm aware of other answers on here for filling columns based on several conditions, like so: dplyr replacing na values in a column based on multiple conditions But I'm not sure how to place priorities. Thank you!

2

2 Answers

3
votes

You can use coalesce() from dplyr which will fill based on the first non-missing value from left to right.

library(dplyr)

df <-data.frame(Age,Age2,Age3)
df$new_age <- coalesce(!!!df)

head(df)

       Age     Age2     Age3  new_age
1 17.19762       NA       NA 17.19762
2 18.84911 21.17693       NA 18.84911
3 27.79354       NA       NA 27.79354
4       NA 15.19072       NA 15.19072
5       NA       NA 27.99254 27.99254
6 28.57532       NA 19.55717 28.57532
1
votes

A base R possibility could be:

apply(data, 1, function(x) x[which(!is.na(x))[1]])