I'm having some trouble with advanced operations in dplyr with grouped data. I'm not sure how to specify if I want to refer to an observation-level value, and when I can specifically refer to the entire vector.
Sample data frame:
df <- as.data.frame(
rbind(
c(11990, 2011, 1, 1, 2010),
c(11990, 2015, 1, 0, NA),
c(11990, 2017, 2, 1, NA),
c(11990, 2018, 2, 1, 2016),
c(11990, 2019, 2, 1, 2019),
c(11990, 2020, 1, 0, NA),
c(22880, 2013, 1, 1, NA),
c(22880, 2014, 1, 0, 2011),
c(22880, 2015, 1, 1, NA),
c(22880, 2018, 2, 0, 2014),
c(22880, 2020, 2, 0, 1979)))
names(df) <- c("id", "year", "house_apt", "moved", "year_moved")
# > df
# id year house_apt moved year_moved
# 1 11990 2011 1 1 2010
# 2 11990 2015 1 0 NA
# 3 11990 2017 2 1 NA
# 4 11990 2018 2 1 2016
# 5 11990 2019 2 1 2019
# 6 11990 2020 1 0 NA
# 7 22880 2013 1 1 NA
# 8 22880 2014 1 0 2011
# 9 22880 2015 1 1 NA
# 10 22880 2018 2 0 2014
# 11 22880 2020 2 0 1979
If I do simple mutate operations:
library(dplyr)
df %>% mutate(year+2)
df %>% group_by(id) %>% mutate(year+2)
It's pretty obvious that "year" here refers to each individual row value. This is the case even if I were to (for some reason) do it with a grouping. However, if I were to do the following two operations which involve a vector operation:
df %>% mutate(sum(year))
df %>% group_by(id) %>% mutate(sum(year))
dplyr understands "year" as the entire vector of year values for that whole group.
However, now I am having a lot of trouble with an operation where it is ambiguous whether I want mutate
to use the row-value or the entire vector. With my data frame, I want to create a variable a guessed moving year for individuals who moved but didn't record the moving date until a later survey instance. Note the data is extremely messy, with some nonsensical moving dates that we want to ignore.
Therefore, I want to create a "guess" value for each row where a person moved but no move_year is recorded. I want the operation to look through the entire vector of moving dates for each individual, subset to include only the ones earlier than the current year, and pick out the one that is the closest to the year for the current row. Granular example: If we look at row #3, the individual moved in that year, but there is no move date. Therefore we want to look at the entire year_moved vector for that person (2010, NA, NA, 2016, 2019, NA)
and choose the one that is the closest to and preferably earlier than the row #3 value of year (2017
). The guess value, therefore, would be 2016
.
Getting the value we want with a given year and vector of values is simple:
year <- 2017
year_moved <- c(2010, 2016, 2017)
year_moved[which.min(year-(year_moved[year_moved<year & !is.na(year_moved)]))]
# [1] 2016
rm(year, year_moved)
However, when I try this within a mutate function, it doesn't give me the same result.
df %>%
group_by(id) %>%
mutate(
year_guess = ifelse(moved==1 & is.na(year_moved),
year_moved[which.min(year-(year_moved[year_moved<year]))],
NA))
# # A tibble: 11 x 6
# # Groups: id [2]
# id year house_apt moved year_moved guess
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 11990 2011 1 1 2010 NA
# 2 11990 2015 1 0 NA NA
# 3 11990 2017 2 1 NA NA
# 4 11990 2018 2 1 2016 NA
# 5 11990 2019 2 1 2019 NA
# 6 11990 2020 1 0 NA NA
# 7 22880 2013 1 1 NA 2011
# 8 22880 2014 1 0 2011 NA
# 9 22880 2015 1 1 NA 2011
# 10 22880 2018 2 0 2014 NA
# 11 22880 2020 2 0 1979 NA
# Warning message:
# In year - (year_moved[year_moved < year & !is.na(year_moved)]) :
# longer object length is not a multiple of shorter object length
(Row 3 should be 2016 and Row 9 should be 2014.) I think part of it is my inability to specify whether I am interested in a row-value or a vector. Note that the first time I refer to "year_moved" (is.na(year_moved)
), I am referring to the value in that row. When I refer to it within the which.min
, I am trying to refer to the groupwise vector. When I refer to "year", I'm trying to refer to the value of the individual row I'm working in. Clearly things are a little muddled, and it's a broader problem I've been running into with many different applications. Can anyone provide guidance?
I've been writing my whole project using tidyverse so would like to continue if possible.
year + 2
is a vectorized operation (with recycling of2
).dplyr
always "understands" a variable as the whole vector (possibly in a group) rather than an observation level single value. – Aurèlerowwise()
beforemutate()
strategy (grouping by rows, so variables withinmutate
refer to vectors... of length 1. Compare @bouncyball's comment and some other possibilities in this answer, for instance: stackoverflow.com/questions/47519201 (map
and variants are similar tos/l/vapply
) – Aurèle