How to use data.table::setDTthreads() in my own package?

Question

I'm developing very small package for the first time (and - perhaps it is important in context of my question - would like to publish it on CRAN). This package uses functions from data.table and base R. I would like to take benefits of paralell computations provided by data.table::setDTthreads() function.

When user loads data.table package, this function is calling immediately, but I'm not doing this when developing my package. What I did now is just: (1) in the DESCRIPTION file I have added data.table to Imports field; (2) in the NAMESPACE I have included import(data.table).

As I know this is not the same as library(data.table) and I don't want this, because I don't want to load data.table when user will load my package. But still I would like to use data.table::setDTthreads() function. Where should I include this in my script? Or maybe I'm using it since in the NAMESPACE I have included import(data.table)?

My package contains only one .R file in R/ directory, with few functions, but only one will be exported to be visible for the user (so the other ones are just helper functions). Let's say, it looks like this:

#' roxygen2 skeleton
main_function <- function(x) {
 x <- helper_function_1(x)
 x
}

helper_function_1 <- function(x) {
 x
}

What I really worry about is that when I will use data.table::setDTthreads() in my package it will have an impact on user's environment, i.e. I will enable parallel computation and set threads not just for my function in my package, but generally for user's session.

jangorecki jangorecki · Accepted Answer · 2021-06-26T18:52:19

Very good question. Yes, it will affect all data.table calls (including those from other packages) in user environment and not just those from your package. General advise is to not set this value in your package but let users know that they could set it themselves. If you want to set it in your package you should document it really well. Note that 50% vs. 100% is often very small difference (can be less than 5%, or even slow down on a shared environments) so I suggest you to measure if it is really worth to mess with user environment if benefits are small. Check those timings for example https://github.com/h2oai/db-benchmark/issues/202

You could also fill a feature request for a possibility to set number of threads just for calls from a single package. It technically possible by checking top environment of a call.

How to use data.table::setDTthreads() in my own package?

1 Answers