2
votes

I try to use dplyr's mutate() function on a SQL like database. I don't understand why it work sometimes and doesn't other times.

Here is one example with the stantard example from https://cran.rstudio.com/web/packages/dplyr/vignettes/databases.html

loading the data into a database

> library(nycflights13)
> library(dplyr)
> db_sqlite <- src_sqlite("data/my_db.sqlite3", create = TRUE)
> flights_sqlite <- copy_to(db_sqlite, flights, temporary = FALSE, indexes = list(
+   c("year", "month", "day"), "carrier", "tailnum"))

Creating a new numeric variable with success

If I try to create a new numeric variable such as year2, everything works

> tbl(db_sqlite, "flights") %>% mutate(year2 = year + 10)
Source:   query [?? x 20]
Database: sqlite 3.8.6 [data/my_db.sqlite3]

    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>
1   2013     1     1      517            515         2      830            819        11      UA   1545
2   2013     1     1      533            529         4      850            830        20      UA   1714
3   2013     1     1      542            540         2      923            850        33      AA   1141
4   2013     1     1      544            545        -1     1004           1022       -18      B6    725
5   2013     1     1      554            600        -6      812            837       -25      DL    461
6   2013     1     1      554            558        -4      740            728        12      UA   1696
7   2013     1     1      555            600        -5      913            854        19      B6    507
8   2013     1     1      557            600        -3      709            723       -14      EV   5708
9   2013     1     1      557            600        -3      838            846        -8      B6     79
10  2013     1     1      558            600        -2      753            745         8      AA    301
# ... with more rows, and 9 more variables: tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>, year2 <dbl>

But failing to create a new character variable using paste0()

If I try to create a new variable using the base function paste0(), I've got an error.

> tbl(db_sqlite, "flights") %>% mutate(date = paste0(year, month, day))
Source:   query [?? x 20]
Database: sqlite 3.8.6 [data/my_db.sqlite3]

Error in sqliteSendQuery(conn, statement) : 
  error in statement: no such function: PASTE0

Session infos are here :

> devtools::session_info()
Session info ------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.3.2 (2016-10-31)
 system   x86_64, linux-gnu           
 ui       RStudio (1.0.35)            
 language (EN)                        
 collate  C                           
 tz       <NA>                        
 date     2016-12-12                  

Packages ----------------------------------------------------------------------------------------------------
 package      * version    date       source                          
 DBI            0.5-1      2016-09-10 cran (@0.5-1)                   
 R6             2.2.0      2016-10-05 cran (@2.2.0)                   
 RSQLite        1.0.0      2014-10-25 CRAN (R 3.2.2)                  
 Rcpp           0.12.8     2016-11-17 cran (@0.12.8)                  
 assertthat     0.1        2013-12-06 CRAN (R 3.2.2)                  
 devtools       1.12.0     2016-06-24 CRAN (R 3.3.1)                  
 digest         0.6.10     2016-08-02 CRAN (R 3.2.3)                  
 dplyr        * 0.5.0.9000 2016-12-12 Github (hadley/dplyr@c846cb3)   
 lazyeval       0.2.0.9000 2016-10-14 Github (hadley/lazyeval@c155c3d)
 magrittr       1.5        2014-11-22 CRAN (R 3.2.3)                  
 memoise        1.0.0      2016-01-29 CRAN (R 3.2.3)                  
 nycflights13 * 0.2.0      2016-04-30 CRAN (R 3.3.2)                  
 tibble         1.2        2016-08-26 CRAN (R 3.2.3)                  
 withr          1.0.2      2016-06-20 CRAN (R 3.2.3)               

What kind of operations work using mutate() on database ?

Is there any solution to use functions such as paste0() on database without collecting data into memory ?

3
dplyr converts some functions into SQL or, if a function doesn't exist, will pass the function name to SQL. You can instead do something like tbl(db_sqlite, "flights") %>% mutate(date = sql("year||month||day")).Michael Griffiths
dbplyr should be able to handle this, although I needed to use paste0(paste0(var1, var2), var3) for it to work when connected to Oracle.pentandrous

3 Answers

1
votes

Ok, thanks to @michael-griffiths comment the answer is : there is no guarantee that an R function can be translated into sql code.

So, if it returns an error, we may try to use the sql() function to write directly sql code.

Note that this depends on the database (sqlite, postgre, mysql or mariadb, etc).

0
votes

I think this is what the collect() method is for

tbl(db_sqlite, "flights") %>% collect() %>% mutate(date = paste0(year, month, day))

collect() retrieves the data into a local tibble, and then you can apply R functions to it.

I did not execute the code above

0
votes

Yes, that's because paste0 does not have a corresponding translation in SQLite. In dbplyr we can usually use the CONCATENATE SQL command to equate the operation done by paste0. This command should work with different SQL back-ends. Other commands that do not work with SQLite, but will with other databases, is min(), and max() for example.

Regarding evaluation, when you equate the dplyr code to a variable, namely: db_var <- db_table %>% mutate(x = x +1) will not be evaluated until you do something with it, like request a print of the results, which is done by passing db_var or pass a collect() command at the end of it.