2
votes

I'm having some trouble with advanced operations in dplyr with grouped data. I'm not sure how to specify if I want to refer to an observation-level value, and when I can specifically refer to the entire vector.

Sample data frame:

df <- as.data.frame(
  rbind(
    c(11990, 2011, 1, 1, 2010),
    c(11990, 2015, 1, 0, NA),
    c(11990, 2017, 2, 1, NA),
    c(11990, 2018, 2, 1, 2016),
    c(11990, 2019, 2, 1, 2019),
    c(11990, 2020, 1, 0, NA),
    c(22880, 2013, 1, 1, NA),
    c(22880, 2014, 1, 0, 2011),
    c(22880, 2015, 1, 1, NA),
    c(22880, 2018, 2, 0, 2014),
    c(22880, 2020, 2, 0, 1979)))
names(df) <- c("id", "year", "house_apt", "moved", "year_moved")

# > df
#       id year house_apt moved year_moved
# 1  11990 2011         1     1       2010
# 2  11990 2015         1     0         NA
# 3  11990 2017         2     1         NA
# 4  11990 2018         2     1       2016
# 5  11990 2019         2     1       2019
# 6  11990 2020         1     0         NA
# 7  22880 2013         1     1         NA
# 8  22880 2014         1     0       2011
# 9  22880 2015         1     1         NA
# 10 22880 2018         2     0       2014
# 11 22880 2020         2     0       1979

If I do simple mutate operations:

library(dplyr)
df %>% mutate(year+2)
df %>% group_by(id) %>% mutate(year+2)

It's pretty obvious that "year" here refers to each individual row value. This is the case even if I were to (for some reason) do it with a grouping. However, if I were to do the following two operations which involve a vector operation:

df %>% mutate(sum(year))
df %>% group_by(id) %>% mutate(sum(year))

dplyr understands "year" as the entire vector of year values for that whole group.

However, now I am having a lot of trouble with an operation where it is ambiguous whether I want mutate to use the row-value or the entire vector. With my data frame, I want to create a variable a guessed moving year for individuals who moved but didn't record the moving date until a later survey instance. Note the data is extremely messy, with some nonsensical moving dates that we want to ignore.

Therefore, I want to create a "guess" value for each row where a person moved but no move_year is recorded. I want the operation to look through the entire vector of moving dates for each individual, subset to include only the ones earlier than the current year, and pick out the one that is the closest to the year for the current row. Granular example: If we look at row #3, the individual moved in that year, but there is no move date. Therefore we want to look at the entire year_moved vector for that person (2010, NA, NA, 2016, 2019, NA) and choose the one that is the closest to and preferably earlier than the row #3 value of year (2017). The guess value, therefore, would be 2016.

Getting the value we want with a given year and vector of values is simple:

year <- 2017
year_moved <- c(2010, 2016, 2017)
year_moved[which.min(year-(year_moved[year_moved<year & !is.na(year_moved)]))]
# [1] 2016
rm(year, year_moved)

However, when I try this within a mutate function, it doesn't give me the same result.

df %>% 
  group_by(id) %>%
  mutate(
    year_guess = ifelse(moved==1 & is.na(year_moved),
                        year_moved[which.min(year-(year_moved[year_moved<year]))],
                        NA))
# # A tibble: 11 x 6
# # Groups:   id [2]
#       id  year house_apt moved year_moved guess
#    <dbl> <dbl>     <dbl> <dbl>      <dbl> <dbl>
#  1 11990  2011         1     1       2010    NA
#  2 11990  2015         1     0         NA    NA
#  3 11990  2017         2     1         NA    NA
#  4 11990  2018         2     1       2016    NA
#  5 11990  2019         2     1       2019    NA
#  6 11990  2020         1     0         NA    NA
#  7 22880  2013         1     1         NA  2011
#  8 22880  2014         1     0       2011    NA
#  9 22880  2015         1     1         NA  2011
# 10 22880  2018         2     0       2014    NA
# 11 22880  2020         2     0       1979    NA
# Warning message:
# In year - (year_moved[year_moved < year & !is.na(year_moved)]) :
#   longer object length is not a multiple of shorter object length

(Row 3 should be 2016 and Row 9 should be 2014.) I think part of it is my inability to specify whether I am interested in a row-value or a vector. Note that the first time I refer to "year_moved" (is.na(year_moved)), I am referring to the value in that row. When I refer to it within the which.min, I am trying to refer to the groupwise vector. When I refer to "year", I'm trying to refer to the value of the individual row I'm working in. Clearly things are a little muddled, and it's a broader problem I've been running into with many different applications. Can anyone provide guidance?

I've been writing my whole project using tidyverse so would like to continue if possible.

1
Re "It's pretty obvious that "year" here refers to each individual row value."... Not that obvious. year + 2 is a vectorized operation (with recycling of 2). dplyr always "understands" a variable as the whole vector (possibly in a group) rather than an observation level single value.Aurèle
Oh interesting. So the whole premise of my question is wrong. You are basically never referring to a row value within mutate, you're always calling the entire vector. Now that you point it out it is very intuitive.MrMr
So is using an apply function as @Marius points out below is the only way to do an operation that involves just the value for that row rather than the whole vector? Or is there some other more generalized way to do this. I'm trying to understand more of the intuition behind vectorized / nonvectorized functions so any help would be great.MrMr
Not the only way. There is also the rowwise() before mutate() strategy (grouping by rows, so variables within mutate refer to vectors... of length 1. Compare @bouncyball's comment and some other possibilities in this answer, for instance: stackoverflow.com/questions/47519201 (map and variants are similar to s/l/vapply)Aurèle

1 Answers

1
votes

I think the most straightforward way to modify your current attempt to get the right results is to wrap the guessing operation in sapply so that a guess is separately calculated for each year:

df %>% 
    group_by(id) %>%
    mutate(
        year_guess = ifelse(
            moved==1 & is.na(year_moved),
            sapply(year, function(x) year_moved[which.min(x-(year_moved[year_moved < x]))]),
            NA)
        )

I haven't been able to fully unpack the logic of how this works but I think as written your guessing procedure is a little bit complex to be easily vectorized (although it probably can be if you approach it in a slightly different way).

Output:

# A tibble: 11 x 6
# Groups:   id [2]
      id  year house_apt moved year_moved year_guess
   <dbl> <dbl>     <dbl> <dbl>      <dbl>      <dbl>
 1 11990  2011         1     1       2010         NA
 2 11990  2015         1     0         NA         NA
 3 11990  2017         2     1         NA       2016
 4 11990  2018         2     1       2016         NA
 5 11990  2019         2     1       2019         NA
 6 11990  2020         1     0         NA         NA
 7 22880  2013         1     1         NA       2011
 8 22880  2014         1     0       2011         NA
 9 22880  2015         1     1         NA       2014
10 22880  2018         2     0       2014         NA
11 22880  2020         2     0       1979         NA