3
votes

I am trying to create a new variable using ifelse by combining data from two data.frames (similar to this question but without factors).

My problem is that df1 features yearly data, whereas vars in df2 are temporally aggregated: e.g. df1 has multiple obs (1997,1998,...,2005) and df2 only has a range (1900-2001).

For illustration, a 2x2 example would look like

df1$id <- c("2","20")
df1$year <- c("1960","1870")

df2$id <- df1$id
df2$styear <- c("1800","1900")
df2$endyear <- c("2001","1950")

I want to combine both in such a way that the id (same variable exists in both) is matched, and further, the year in df1 is within the range of df2. I tried the following

df1$new.var <- ifelse(df1$id==df2$id & df1$year>=df2$styear & 
df1$year<df2$endyear,1,0)

Which ideally should return 1 and 0, respectively.

But instead I get warning messages:

1: In df1$id == df2$id : longer object length is not a multiple of shorter object length

2: In df1$year >= df2$styear : longer object length is not a multiple of shorter object length

3: In df1$year < df2$endyear : longer object length is not a multiple of shorter object length

For the record, the 'real' df1 has 500 obs and df2 has 14. How can I make this work?

Edit: I realised some obs in df2 are repeated, with multiple periods e.g.

id    styear    endyear
1      1800      1915
1      1950      2002
2      1912      1988
3      1817      2000

So, I believe what I need is something like a double-ifelse:

df1$new.var <- ifelse(df1$id==df2$id & df1$year>=df2$styear & 
df1$year<df2$endyear | df1$year>=df2$styear & 
df1$year<df2$endyear,1,0)

Obviously, this wouldn't work, but it is a way to get out of the duplicates-problem.

For example, if id=1 in df1$year=1801, it will pass the first year-range test (1801 is between 1800-1915), but fail the second one (1801 is not between 1950-2002), so it is only coded once and no extra rows are added (currently the duplicates add extra rows).

4
@Bulat Hello, foverlaps was recommended by others too, I can't seem to get it to work - says "Duplicate columns are not allowed in overlap joins. This may change in the future." - user6550364
can you provide a reproducible example please. - Bulat

4 Answers

1
votes
df1$id <- c("2","20")
df1$year <- c("1960","1870")

df2$id <- df1$id
df2$styear <- c("1800","1900")
df2$endyear <- c("2001","1950")

library(dplyr)
df3 <- left_join(df1,df2,by = "id") %>% filter(year <= endyear,year >= startyear)

I highly recommend the dplyr package for data manipulation.

0
votes

With base R:

df1 <- data.frame(id=c(2,20,22), year=c(1960,1870, 2016))
df2 <- data.frame(id=c(2,20,21), styear=c(1800,1900,2000), endyear=c(2001,1950,2016))

df1
id year
1  2 1960
2 20 1870
3 22 2016

df2

id styear endyear
1  2   1800    2001
2 20   1900    1950
3 21   2000    2016

df1 <- merge(df1, df2, by='id', all.x = TRUE)
df1$new.var <- !is.na(df1$styear) & df1$year>=df1$styear & df1$year< df1$endyear
df1 <- df1[c('id', 'year', 'new.var')]

df1
  id year new.var
1  2 1960    TRUE
2 20 1870   FALSE
3 22 2016   FALSE
0
votes

Alright, I made it work for myself. Beware, it is quite convoluted and probably contain some redundancies. After a brief look at the data wrangling cheatsheet, assuming you have df1 and df2 with an identical var and df2 contains new.var, one can do the following:

library(dplyr)
#Join everything, all values and rows
df3 <- full_join(df1,df2,by="id")
#filter out obs those year is greater than endyear
df3 <- filter(df3,df3$year<=df3$endyear)
#same, the other way around
df3 <- filter(df3,df3$year>=df3$styear) 
df3 <- distinct(df3) #remove duplicate rows (at least I had some)

As far as I can tell by looking at the end result, this method only extracts information from the correct time period while dropping all other time periods in df2. Then, it is a matter of merging with the original data.frame (df1) and filling in the NAs:

df1 <- merge(df1,df3,by=(id),all.x=TRUE)
df1 <- distinct(df1) #just to make sure, I still had three
df1$new.var <- ifelse(is.na(df1$new.var),0,df1$new.var)

which is what I wanted.

0
votes

This can be solved easily and efficiently using non-equi joins in data.table devel version (1.9.7+):

library(data.table)
setDT(df1); setDT(df2) # converting to data.table in place

df1[, new.var := df2[df1, on = .(id, styear <= year, endyear >= year),
                     .N > 0, by = .EACHI]$V1]
df1
#   id year new.var
#1:  2 1960    TRUE
#2: 20 1870   FALSE

The above join looks for matches in df2 for each row of df1 (by = .EACHI), and checks the number of matching rows (.N).