4
votes

We need to join a database table with the content of an excel file. This is straightforward using a dplyr left_join, but requires that copy=TRUE is set in the join because the data do not come from the same source. This in turn means that the code will only run when the database user has INSERT privileges so that left_join can create temporary tables in the /tmp folder.

Is there any way to do this copy=TRUE left_join without granting INSERT privileges? The database user accessing the data for analysis really should be a reader only.

If insert privileges are not granted an error like:

Error in .local(conn, statement, ...) : 
  could not run statement: INSERT command denied to user 
  'reader'@'192.168.135.1' for table 'utiexriryc'

will appear (with a different random table name on each connect).

A short reproducible example is difficult as it requires a database connection (in this case to mysql wtih src_mysql()) and an excel file (in this case read with readxl). The join looks like:

df.biozones <- db.sites %>% 
  left_join(ef.join_site_ids, by=c("site_id"="id"), copy=TRUE) %>% 
  collect()

Where db.sites is a mysql database table and ef.join_site_ids an data.frame from an excel spreadsheet read with readxl.

Is there any way to do the above in dplyr and avoid granting INSERT privileges for the user reading the data?

1
I don't see how. I'm surprised however, that you have the ability to create a temporary table, but not insert data into it!hadley

1 Answers

2
votes

Although there is no way to do what I asked directly, as @hadley said, there is a different way to look at the problem. What took me some time to understand is that dplyr treats each connection as a difference source (or src for short). There is on src that is always there which is the collection of objects in memory.

So in my question I was actually dealing with three sources:

  1. Objects in memory
  2. Database_1
  3. Database_2

Any join between either of these three will require the permissions for dplyr to make temporary tables and insert into them. Any join within any of these three will not be an issue. So there are (as least) three possible ways to deal with this when you cannot get additional permissions to the databases.

1. Load everything into memory

If you source all the objects you need from both databases into memory, then you will bring them into a shared src and will be able to do any joins on them. Dplyr offers the collect() function to do so:

db_table_of_interest <- tbl(Database_1, "table_of_interest")
df_table_of_interest <- collect(db_table_of_interest).

You will in fact turn the table into a local dataframe. You will need to do this for all tables of interest, in all databases of interest. It will work fine as long as you have enough RAM to harbour all the data you are reading.

2. Copy identifiers between databases

If (as was our case) you need to join the record identifiers (like site or sample ID's) between two databasas that have the same sites under different ID's, then the easiest way is to double the ID's in a table in both database. So instead of using an external sheet where you list all the ID's of each site in each database, and set up the matches, add one column in the site_table in each database with a reference to the other.

As long as you do not need to set up joined tables with the actual data, then you will be fine. For example if you need to read in a sites-by-species and a sites-by-environment table where the species and environment data are stored in a separate database, and have different ID's for each site.

If you do need to join tables, then at least you will be able to pre-select in both databases before doing a collect() and reading everything in memory. That will save you memory requirements when there is a lot of data to read.

3. Copy everything into one analysis database

If for some reason more memory is not possible, then in theory you could copy all the data into a single database first. I cannot imagine that this is a viable option in any situation other than that there are only a few tables of interest so that it would be feasible to merge them into a new joint database. Perhaps in cases where the data set is very large and does not change.

If there is an option I have missed in this list, I would be happy to know what other solutions you may have found.