How to create a dummy if one variable matches another (by group id) in a time series df?

Question

I have a large data frame that looks like this. I am pasting the dput output of my df below:

df <- structure(list(id = c(32, 32, 32, 32, 32, 32, 32, 12668031110,12668031110, 12668031110), survey_date = structure(c(17167, 17167,17167, 17167, 17167, 17167, 17167, 15034, 15034, 15034), class = "Date"),survey_year = c(2017, 2017, 2017, 2017, 2017, 2017, 2017,2011, 2011, 2011), mom_dob = c(1991, 1991, 1991, 1991, 1991,1991, 1991, 1987, 1987, 1987), date = structure(c(10592,10957, 11323, 11688, 12053, 12418, 12784, 14304, 14669, 15034), class = "Date"), date_year = c(1999, 2000, 2001, 2002,2003, 2004, 2005, 2009, 2010, 2011), mom_age = c(7, 8, 9,10, 11, 12, 13, 21, 22, 23), newborn = c(0, 0, 0, 0, 0, 0,0, 0, 0, 0), stock = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), family500_year = c(1,1, 1, 1, 1, 1, 1, 0, 0, 0), nchild1 = c(2015, 2015, 2015,2015, 2015, 2015, 2015, NA, NA, NA), nchild2 = c(NA, NA,NA, NA, NA, NA, NA, 2010, 2010, 2010), nchild3 = c(NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_, NA_real_), nchild4 = c(NA_real_, NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_), nchild5 = c(NA_real_, NA_real_, NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,NA_real_), nchild6 = c(NA_real_, NA_real_, NA_real_, NA_real_,NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), nchild7 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), nchild8 = c(NA,NA, NA, NA, NA, NA, NA, NA, NA, NA), nchild9 = c(NA, NA,NA, NA, NA, NA, NA, NA, NA, NA), nchild10 = c(NA, NA, NA,NA, NA, NA, NA, NA, NA, NA), educcat = c(2, 2, 2, 2, 2, 2,2, 3, 3, 3), educcat_college = c(0, 0, 0, 0, 0, 0, 0, 1,1, 1), hh_income_net = c(3410, 3410, 3410, 3410, 3410, 3410,3410, 7978.7001953125, 7978.7001953125, 7978.7001953125),hh_income_annual_usd = c(10912, 10912, 10912, 10912, 10912,10912, 10912, 25531.840625, 25531.840625, 25531.840625),hh_income_annual_log = c(9.29761838008324, 9.29761838008324,9.29761838008324, 9.29761838008324, 9.29761838008324, 9.29761838008324,9.29761838008324, 10.1476816041898, 10.1476816041898, 10.1476816041898), marital_stat = c(10, 10, 10, 10, 10, 10, 10, 20, 20, 20), maritalcat = c(0, 0, 0, 0, 0, 0, 0, 1, 1, 1), rural = c(1,1, 1, 1, 1, 1, 1, 0, 0, 0), age_sq = c(625, 625, 625, 625,625, 625, 625, 529, 529, 529), emp_stat = c(3, 3, 3, 3, 3,3, 3, 6, 6, 6), occupation = c("98", "98", "98", "98", "98","98", "98", "48", "48", "48"), disability_stat = c(2, 2,2, 2, 2, 2, 2, 2, 2, 2), weight = c(1039, 1039, 1039, 1039,1039, 1039, 1039, 1457, 1457, 1457), region = c(2, 2, 2,2, 2, 2, 2, 12, 12, 12), birth_country = c(1, 1, 1, 1, 1,1, 1, 1, 1, 1), birth_citizenship = c(1, 1, 1, 1, 1, 1, 1,1, 1, 1)), row.names = c(NA, -10L), class = c("tbl_df", "tbl","data.frame"))

It's a df containing the fertility history of each woman. There are 18 rows for each woman because I'm reconstructing her births from the past 18 years.

Now, what I'm trying to do is to generate a dummy variable for newborn, where the value of nchild1, nchild2, nchild3, etc matches date_year. For example, for woman id #62, if nchild1 is 2004, I want newborn to be 1 in the same row as date_year=2004 AND 1 in the same row as date_year=2005 (and the rest to be 0s). Relatedly, I want stock (i.e. the number of children) to turn to 1 the year the first child is born (in this case, 2004) and remain 1 until the last year observed (2017).

## example df of what I have now: 
id   date_year  newborn   stock   nchild1  nchild2  nchchild3   
62       1996      0        2      2004      2005       NA     
62       1997      0        2      2004      2005       NA     
62       1998      0        2      2004      2005       NA      
62       1999      0        2      2004      2005       NA     
62       2000      0        2      2004      2005       NA     
62       2001      0        2      2004      2005       NA     
62       2002      0        2      2004      2005       NA      
62       2003      0        2      2004      2005       NA       
62       2004      0        2      2004      2005       NA     
62       2005      0        2      2004      2005       NA     
62       2006      0        2      2004      2005       NA     
62       2007      0        2      2004      2005       NA     
62       2008      0        2      2004      2005       NA     
62       2009      0        2      2004      2005       NA     
62       2010      0        2      2004      2005       NA      
62       2011      0        2      2004      2005       NA
62       2012      0        2      2004      2005       NA     
62       2013      0        2      2004      2005       NA
62       2014      0        2      2004      2005       NA

## desired df
id   date_year  newborn   stock   nchild1  nchild2  nchchild3   
62       1996      0        0      2004      2005       NA     
62       1997      0        0      2004      2005       NA     
62       1998      0        0      2004      2005       NA      
62       1999      0        0      2004      2005       NA     
62       2000      0        0      2004      2005       NA     
62       2001      0        0      2004      2005       NA     
62       2002      0        0      2004      2005       NA      
62       2003      0        0      2004      2005       NA       
62       2004      1        1      2004      2005       NA     
62       2005      1        2      2004      2005       NA     
62       2006      0        2      2004      2005       NA     
62       2007      0        2      2004      2005       NA     
62       2008      0        2      2004      2005       NA     
62       2009      0        2      2004      2005       NA     
62       2010      0        2      2004      2005       NA      
62       2011      0        2      2004      2005       NA
62       2012      0        2      2004      2005       NA     
62       2013      0        2      2004      2005       NA
62       2014      0        2      2004      2005       NA

So far I've tried to write code using ifelse, but I'm having trouble figuring out how to denote the condition that date_year==nchild1 | nchild2 | nchild3, etc.

df$newborn <- ifelse(df$newborn == 1, df$nchild1==df$date_year, df$newborn)

Any and all help would be appreciated, thanks so much.

Edited to make sure the dput matches the actual df.

UPDATED SOLUTION:

  df <- df %>% 
    rowwise() %>%
    mutate(newborn = ifelse(date_year %in% c(nchild1, nchild2, nchild3, nchild4, nchild5, nchild6, nchild7, nchild8, nchild9, nchild10), 1, 0)) %>%
    group_by(id) %>%
    mutate(stock = cumsum(newborn))

Updated solution: Polish_panel_hist_06 %>% rowwise() %>% mutate(newborn = ifelse(date_year %in% c(nchild1, nchild2, nchild3), 1, 0)) %>% group_by(id) %>% mutate(stock = cumsum(newborn)) — Anna

cg1979 cg1979 · Accepted Answer · 2021-11-22T07:58:33

The data you posted doesn't exactly contain your output

Polish_panel_hist_06 %>% 
select(id,date_year,newborn,stock,nchild1,nchild2,nchild3) %>% 
filter(id == 32)

# A tibble: 7 x 7
 id date_year newborn stock nchild1 nchild2 nchild3
 <dbl>     <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl>
1    32      1999       0     1    2015      NA      NA
2    32      2000       0     1    2015      NA      NA
3    32      2001       0     1    2015      NA      NA
4    32      2002       0     1    2015      NA      NA
5    32      2003       0     1    2015      NA      NA
6    32      2004       0     1    2015      NA      NA
7    32      2005       0     1    2015      NA      NA

but i think this could be what you are looking for

library('dplyr')
new <-Polish_panel_hist_06 %>% 
  mutate(newborn = ifelse(date_year %in% c(nchild1,nchild2,nchild3),1,0),
  stock = cumsum(newborn))

How to create a dummy if one variable matches another (by group id) in a time series df?

1 Answers