1
votes

dbplyr translates dplyr and base R commands into SQL so that developers can write R code and have it executed in a database (tidyverse reference). When working with dates in R, it is common to use the lubridate package. However, dbplyr translations for lubridate functions do not exist at present. So developers using dbplyr need to find alternative ways of working with dates.

My previous approach has been to use fragments of SQL syntax in my dplyr commands (see for example answers: here and here). However, this requires the developer to know (or find out) the corresponding SQL commands, when part of the point of dbplyr is that it translates to SQL for you.

This makes me ask: What is the best way to manipulate dates using dbplyr translation only when connecting to a remote database?

The ideal solution will:

  • Use only dbplyr translation, so functions without a dbplyr translation cannot be used.
  • Use pure R, no SQL fragments.
  • Run on the database, so a remote table instead of a local table.

I think at a minimum we should be able to:

  • extract year, month, and day
  • combine year, month, and day into a new date

From these you can manually do other operations such as:

  • Increment a date
  • Find the difference between two dates
  • Find the date of the end of the month

But faster/elegant ways to do these more advance operations would be preferable.

2
On PostgreSQL, at least using RPostgres, you can use lubridate functions quite readily, as illustrated here. I think one can do all the things you ask for in this way. If this doesn't work with other SQL implementations (e.g., MySQL), this may be require work to the associated packages (e.g., in RMySQL).Ian Gow

2 Answers

2
votes

One answer is that much of this is already possible. (See answer here.)

To the extent that desired functions are missing from dbplyr, one idea would be to write pull requests that add more translations of lubridate functions to the backends in dbplyr.

It seems that translations are inevitably backend-specific. If you look at the PostgreSQL backend here, you can see that some lubridate functions (e.g., month or quarter) are given translations there, but others (e.g., ymd) are not.

0
votes

The first approach that comes to mind is converting dates to text, because there are already dbplyr translations for different forms of text manipulation. This approach depends on as.character to convert dates to character, and substr to extract the year, month, or day as text. This can then be converted to numeric and manipulated further.

(1) Set up simulated database connection to test translation (pick your preferred SQL flavor):

library(dplyr)
library(dbplyr)

df = data.frame(start_dates = c('2020-01-31', '2020-02-28', '2020-03-31'))

# simulate a connection to test translation (pick your preferred flavor)
df = tbl_lazy(df, con = simulate_mssql())
# df = tbl_lazy(df, con = simulate_hive())
# df = tbl_lazy(df, con = simulate_impala())
# df = tbl_lazy(df, con = simulate_oracle())
# df = tbl_lazy(df, con = simulate_postgres())
# df = tbl_lazy(df, con = simulate_mysql())
# df = tbl_lazy(df, con = simulate_sqlite())

(2) Example - extract date components, increment year, and recombine:

output = df %>%
  mutate(text_date = as.character(start_dates)) %>%
  mutate(text_year = substr(text_date, 1, 4),
         text_month = substr(text_date, 6, 7),
         text_day = substr(text_date, 9, 10)) %>%
  mutate(num_year = as.numeric(text_year),
         num_month = as.numeric(text_month),
         num_day = as.numeric(text_day)) %>%
  select(start_dates, num_year, num_month, num_day) %>%
  mutate(next_year = num_year + 1) %>%
  mutate(next_year_text_date = paste0(next_year, '-', num_month, '-', num_day)) %>%
  mutate(next_year_date = as.Date(next_year_text_date)) %>%
  select(start_dates, next_year_date)

Calling show_query(output) then gives the following translation, but not as nicely formatted. I know the nested queries are not considered good SQL practice, but this is how dbplyr translation works.

SELECT `start_dates`
    , TRY_CAST(`next_year_text_date` AS DATE) AS `next_year_date`
FROM (
    SELECT `start_dates`
        , `num_year`
        , `num_month`
        , `num_day`
        , `next_year`
        , `next_year` + '-' + `num_month` + '-' + `num_day` AS `next_year_text_date`
    FROM (
        SELECT `start_dates`
            , `num_year`
            , `num_month`
            , `num_day`
            , `num_year` + 1.0 AS `next_year`
        FROM (
            SELECT `start_dates`
                , TRY_CAST(`text_year` AS FLOAT) AS `num_year`
                , TRY_CAST(`text_month` AS FLOAT) AS `num_month`
                , TRY_CAST(`text_day` AS FLOAT) AS `num_day`
            FROM (
                SELECT `start_dates`
                    , `text_date`
                    , SUBSTRING(`text_date`, 1, 4) AS `text_year`
                    , SUBSTRING(`text_date`, 6, 2) AS `text_month`
                    , SUBSTRING(`text_date`, 9, 2) AS `text_day`
                FROM (
                    SELECT `start_dates`
                        , TRY_CAST(`start_dates` AS VARCHAR(MAX)) AS `text_date`
                    FROM `df`
                ) `q01`
            ) `q02`
        ) `q03`
    ) `q04`
) `q05`

(3) Extract components, compact:

output = df %>%
  mutate(num_year = as.numeric(substr(as.character(start_dates), 1, 4)),
         num_month = as.numeric(substr(as.character(start_dates), 6, 7)),
         num_day = as.numeric(substr(as.character(start_dates), 9, 10)))

With SQL translation from show_query(output) much shorter:

SELECT `start_dates`
    , TRY_CAST(SUBSTRING(TRY_CAST(`start_dates` AS VARCHAR(MAX)), 1, 4) AS FLOAT) AS `num_year`
    , TRY_CAST(SUBSTRING(TRY_CAST(`start_dates` AS VARCHAR(MAX)), 6, 2) AS FLOAT) AS `num_month`
    , TRY_CAST(SUBSTRING(TRY_CAST(`start_dates` AS VARCHAR(MAX)), 9, 2) AS FLOAT) AS `num_day`
FROM `df`

Hopefully this works on all the flavors of SQL that dbplyr can translate to. As I do not have access to every SQL flavor to test it, comments from people who have tested it on a specific SQL flavor would be helpful.