0
votes

I'm working with a data.table in R with ~6e6 rows and I created a function that I pass through data.table to create a new column based on two grouping values. Technically, my function loops through each row of the grouped arguments and does some very simple algebraic operations, but given the size of my data.table, this will take quite some time.

I'm familiar with the foreach() function and other functions that use multiple cores for computing, but I haven't read or come across a way to use parallelization to speed up a for-loop that specified within a function that's passed through data.table. Essentially, I want each for loop iteration to be handled by multiple cores opposed to one. Has anyone had experience with this and/or implemented this within data.table while using a user specified function containing a for-loop?

2
Please provide a reproducible example of the for-loop you are trying to parallelize.F. Privé

2 Answers

2
votes

I think the best answer for you maybe that Matt Dowle and others are making a concerted effort to internalize parallelization in data.table. To be honest, I can't quite follow all of the discussion, but I have learned from experience that grouping is now parallelised in data.table and that the command:

setDTthreads(0)

helps me more than not. Some links here:

Is grouping parallelised in data.table 1.12.0? https://www.rdocumentation.org/packages/data.table/versions/1.12.2/topics/setDTthreads https://github.com/Rdatatable/data.table/issues/2031

I also want to point out that dcast handles grouping very quickly. I often use it like this:

# Some grouped Data:
z1[,1:4]
      CountyCode Gender  LYV Active
   1:         GY      M 2019   5742
   2:         KI      M 2019 244077
   3:         KI      F 2019 266944
   4:         CR      M 2018  51993
   5:         GY      M 2008    150
  ---                              
2172:         WT      U 2017      1
2173:         WK      M 2005      1
2174:         YA      U 1900     28
2175:         WL      U 1900      5
2176:         WK      U 1900      2

# Selecting group sums by gender for particular years:
z1[LYV %in% c("1900","2008","2012","2016","2017","2018","2019"),
.(Gender,LYV,Active)][,dcast(.SD,Gender ~ LYV,value.var="Active",fun.aggregate=sum)]

   Gender   1900  2008  2012   2016  2017   2018   2019
1:      F 275845 15694 43851 191024 27996 927968 777369
2:      M 307010 14543 41069 165942 24837 849066 688101
3:      U   6183    22    94   1161   233   5589   4804

Recent improvements in 'dcast' have given it a lot of flexibility. Using brackets to pipe it to the recursed data.table ('.SD') is frowned against in some documents I have read. But it works just fine for me. You might be able to simply use a 'set' command and 'dcast' to achieve your optimization requirements without manual parallelization.

0
votes

Since you don't provide sample data, here's a simple example that might get you started.

library(data.table)
library(doParallel)

dt <- data.table(a = sample(1:3, 1e6, TRUE),
                 b = sample(letters[1:5], 1e6, TRUE),
                 x = rnorm(1e6))

workers <- makeCluster(detectCores())
registerDoParallel(workers)

ids <- dt[, .(list(.I)), by = .(a, b)]

dt[unlist(ids$V1), y := foreach(i = ids$V1, .combine = c, .export = "dt", .packages = "data.table") %dopar% {
  setDT(dt)[i, as.numeric(scale(x))]
}]

stopCluster(workers); registerDoSEQ(); rm(workers)

# sanity check
dt[, identical(y, as.numeric(scale(x))), by = .(a, b)]
    a b   V1
 1: 2 c TRUE
 2: 1 a TRUE
 3: 3 d TRUE
 4: 1 d TRUE
 5: 1 b TRUE
 6: 3 c TRUE
 7: 2 e TRUE
 8: 3 e TRUE
 9: 2 a TRUE
10: 2 b TRUE
11: 2 d TRUE
12: 1 c TRUE
13: 3 a TRUE
14: 3 b TRUE
15: 1 e TRUE

We first get the row indices for each group and save them in ids (in a list so that they can be directly passed to foreach). The line that assigns y passes the unlisted indices to data.table's i so that the result of foreach is assigned to the appropriate rows.

We use setDT inside the foreach code because the table is serialized to the workers, so the address in memory changes (at least I think so, maybe someone else can confirm).

Definitely benchmark it with your actual function, using foreach is no guarantee for speedups. Given the serialization, the copy of data might be too much overhead, relatively speaking.