I've been working on a lot of new projects where I'm doing lots of unfamiliar data prep and management.
I have two data frames: 1) that is very large with thousands of observations and variables (df1), and 2) a data frame that lists ranges of collection years for a subset of variables in df1 (df2). I need to create a new variable in df1 for a large subset of variables/columns in df1. The new variables created for df1 will check if a value is present (1), a value isn't present for a year that was collected (0), or a value isn't present and the year falls outside of the collection range listed in df2 ('NA').
I've spent a couple days reading a ton of lapply()
but I can't seem to find one that addresses my needs or deals with type of complexity such that I don't have to do this with brute force.
Here is my workable starting data frame:
grp <- c('a', 'a', 'a', 'b', 'b')
year <- c(1991, 1992, 1993, 2005, 2010)
v1 <- c(20.5, 30.5, 29.6, 28.7, 26.1)
v2 <- c(100.0, 101.5, 105.1, 'NA', 95.0)
v3 <- c(47.2, 'NA', 'NA', 'NA', 'NA')
df1 <- data.frame(grp = grp, year = year, v1 = v1, v2 = v2, v3 = v3)
df1
grp year v1 v2 v3
a 1991 20.5 100 47.2
a 1992 30.5 101.5 NA
a 1993 29.6 105.1 NA
b 2005 28.7 NA NA
b 2010 26.1 95 NA
Here is my reference data frame with coverages for variables in df1:
vars <- c('v1', 'v2', 'v3')
start <- c(1989, 2004, 1980)
end <- c(2015, 2011, 1994)
df2 <- data.frame(vars = vars, start = start, end = end)
df2
vars start end
v1 1989 2015
v2 2004 2011
v3 1980 1994
I've been learning with simple stuff with 'lapply()' like:
test <- df1[paste0(vars, '.cov')] <- lapply(df1[vars], function(x) as.integer(x > 0))
I wrote out in R, what I think are, the types of conditions that need to be met. I'll narrate with written English:
a value is present for a year that was collected (1)
if (!is.na(x)) { x <- 1 }
a value isn't present for a year that falls within the range listed in df2 (0)
if (is.na(x) & year %in% seq(df2$start[df2$vars == names(df1[x]), ], df2$end[df2$vars == names(df1[x]), ], 1)) { x <- 0 }
a value isn't present and the year falls outside of the collection range listed in df2 ('NA')
if (is.na(x) & !(year %in% seq(df2$start[df2$vars == names(df1[x]), ], df2$end[df2$vars == names(df1[x]), ], 1))) { x <- 'NA' }
I did my best with the syntax and indexing, but we're rapidly getting out of my comfort zone.
After running the conditional checks the desired output/modified df1 should look like:
grp year v1 v2 v3 v1.cov v2.cov v3.cov
a 1991 20.5 100 47.2 1 1 1
a 1992 30.5 101.5 NA 1 1 0
a 1993 29.6 105.1 NA 1 1 0
b 2005 28.7 NA NA 1 0 NA
b 2010 26.1 95 NA 1 1 NA
I'm open to a variety of solutions but this seemed the likely path to move through. Thanks again for all the help. I'm an experienced R modeler/scientist but I've learned so much data prep, 'data.table', and 'dplyr' in the past month with all your help.
NA
, not'NA'
, which is just a string. - Frank