3
votes

I've got data from a number of surveys. Each survey can be sent multiple times with updated values. For each survey/row in the dataset there's a date when the survey was submited (created). I'd like to merge the rows for each survey and keep the date from the first survey but other data from the last survey.

A simple example:

#>   survey    created var1 var2
#> 1     s1 2020-01-01   10   30
#> 2     s2 2020-01-02   10   90
#> 3     s2 2020-01-03   20   20
#> 4     s3 2020-01-01   45    5
#> 5     s3 2020-01-02   50   50
#> 6     s3 2020-01-03   30   10

Desired result:

#>   survey    created var1 var2
#> 1     s1 2020-01-01   10   30
#> 2     s2 2020-01-02   20   20
#> 3     s3 2020-01-01   30   10

Example data:

df <- data.frame(survey = c("s1", "s2", "s2", "s3", "s3", "s3"),
                created = as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-03", "2020-01-01", "2020-01-02", "2020-01-03"), "%Y-%m-%d", tz = "GMT"),
                var1 = c(10, 10, 20, 45, 50, 30),
                var2 = c(30, 90, 20, 5, 50, 10),
                stringsAsFactors=FALSE)

I've tried group_by with summarize in different ways but can't make it work, any help would be highly appreciated!

2

2 Answers

3
votes

After grouping by 'survey', change the 'created' as the first or min value in 'created' and then slice the last row (n())

library(dplyr)
df %>% 
   group_by(survey) %>% 
   mutate(created = as.Date(first(created))) %>% 
   slice(n())
# A tibble: 3 x 4
# Groups:   survey [3]
#  survey created     var1  var2
#  <chr>  <date>     <dbl> <dbl>
#1 s1     2020-01-01    10    30
#2 s2     2020-01-02    20    20
#3 s3     2020-01-01    30    10

Or using base R

transform(df, created = ave(created, survey, FUN = first)
         )[!duplicated(df$survey, fromLast = TRUE),]
2
votes

After selecting the first created date we can select the last values from all the columns.

library(dplyr)

df %>%
  group_by(survey) %>%
  mutate(created = as.Date(first(created))) %>%
  summarise(across(created:var2, last))
  #In older version use `summarise_at`
  #summarise_at(vars(created:var2), last)


# A tibble: 3 x 4
#  survey created     var1  var2
#  <chr>  <date>     <dbl> <dbl>
#1 s1     2020-01-01    10    30
#2 s2     2020-01-02    20    20
#3 s3     2020-01-01    30    10