4
votes

This question arose by working on this The R dplyr function arrange(ymd(col)) is not working

We have this data frame:

df <- structure(list(record_id = 1:5, group = c("A", "B", "C", "D", 
"E"), date_start = c("Apr-22", "Aug-21", "Jan-22", "Feb-22", 
"Dec-21")), class = "data.frame", row.names = c(NA, -5L))

  record_id group date_start
1         1     A     Apr-22
2         2     B     Aug-21
3         3     C     Jan-22
4         4     D     Feb-22
5         5     E     Dec-21

We would like to sort date_start:

My first approach: worked

library(dplyr)
library(lubridate)
df %>%
  mutate(date_start1 = myd(paste0(date_start,"-01"))) %>% 
  arrange(date_start1) %>% 
  select(-date_start1)

  record_id group date_start
1         2     B     Aug-21
2         5     E     Dec-21
3         3     C     Jan-22
4         4     D     Feb-22
5         1     A     Apr-22

Then I tried this and it also worked

library(dplyr)
library(lubridate)
df %>% 
  arrange(date_start1 = myd(paste0(date_start,"-01")))

  record_id group date_start
1         2     B     Aug-21
2         5     E     Dec-21
3         3     C     Jan-22
4         4     D     Feb-22
5         1     A     Apr-22

I would like to understand how one arrange can do the same as a combination of mutate, arrange and select

1
If you look at the last one, arrange doesn't create a new column, date_start1 in the dataset ie. ... - <data-masking> Variables, or functions of variables. - akrun
That's a cool trick! It seems like arrange is invisibly creating a temporary date_start1 to sort off of and then removing it. Can't find that documented anywhere. - Dan Adams
arrange.data.frame calls dplyr:::arrange_rows and if you check it is doing a loop with map2 (transmute is also used) - akrun
I did not know this but I think it makes more sense if you think about it without the assignment in the arrange statement, i.e. arrange(myd(paste0(date_start,"-01"))). I wouldn't use it though - fewer keystrokes but makes the code less clear. - SamR

1 Answers

1
votes

What the code is not doing

The output of arrange() is perhaps surprising because you think it is doing the following:

Everything to the right of the = is just a function to create a vector.

time_col <- df$date_start %>% 
  paste0(.,"-01") %>%
  myd() %>%
  print()
#> [1] "2022-04-01" "2021-08-01" "2022-01-01" "2022-02-01" "2021-12-01"

The = of course is assignment to a new column :

  df <- df %>%
  mutate(date_start1 = time_col) %>%
    print()
#>   record_id group date_start date_start1
#> 1         1     A     Apr-22  2022-04-01
#> 2         2     B     Aug-21  2021-08-01
#> 3         3     C     Jan-22  2022-01-01
#> 4         4     D     Feb-22  2022-02-01
#> 5         5     E     Dec-21  2021-12-01

You're then sorting on that variable:

  df %>% arrange(date_start1)
#>   record_id group date_start date_start1
#> 1         2     B     Aug-21  2021-08-01
#> 2         5     E     Dec-21  2021-12-01
#> 3         3     C     Jan-22  2022-01-01
#> 4         4     D     Feb-22  2022-02-01
#> 5         1     A     Apr-22  2022-04-01

What the code is doing

If you look at the output, the code is not actually doing what is shown previously and then removing a column. It is missing the new column date_start1 without us even needing to remove it manually:

  df %>% 
    arrange(date_start1 = myd(paste0(date_start,"-01")))
#>   record_id group date_start
#> 1         2     B     Aug-21
#> 2         5     E     Dec-21
#> 3         3     C     Jan-22
#> 4         4     D     Feb-22
#> 5         1     A     Apr-22

The key then is to understand that you are not creating a new variable that is added to the data.frame, sorting on it, then removing it. Rather, you are passing a set of values (one per row) on which to sort.

Why this is possible

This is permitted because you can pass any arbitrary vector that may (not) be a function of the variables in the data. As noted in the documentation for arrange(), the second argument is:

Variables, or functions of variables. Use desc() to sort a variable in descending order.

All you are doing is passing a function of variables! This is why you can also do:

  df %>% 
    arrange(1:nrow(df) + record_id)
  #>   record_id group date_start
  #> 1         1     A     Apr-22
  #> 2         2     B     Aug-21
  #> 3         3     C     Jan-22
  #> 4         4     D     Feb-22
  #> 5         5     E     Dec-21