Running a user defined for-loop function in parallel on grouped data.table

Question

I'm working with a data.table in R with ~6e6 rows and I created a function that I pass through data.table to create a new column based on two grouping values. Technically, my function loops through each row of the grouped arguments and does some very simple algebraic operations, but given the size of my data.table, this will take quite some time.

I'm familiar with the foreach() function and other functions that use multiple cores for computing, but I haven't read or come across a way to use parallelization to speed up a for-loop that specified within a function that's passed through data.table. Essentially, I want each for loop iteration to be handled by multiple cores opposed to one. Has anyone had experience with this and/or implemented this within data.table while using a user specified function containing a for-loop?

Please provide a reproducible example of the for-loop you are trying to parallelize. — F. Privé

rferrisx rferrisx · Accepted Answer · 2019-09-14T19:35:17

I think the best answer for you maybe that Matt Dowle and others are making a concerted effort to internalize parallelization in data.table. To be honest, I can't quite follow all of the discussion, but I have learned from experience that grouping is now parallelised in data.table and that the command:

setDTthreads(0)

helps me more than not. Some links here:

Is grouping parallelised in data.table 1.12.0? https://www.rdocumentation.org/packages/data.table/versions/1.12.2/topics/setDTthreads https://github.com/Rdatatable/data.table/issues/2031

I also want to point out that dcast handles grouping very quickly. I often use it like this:

# Some grouped Data:
z1[,1:4]
      CountyCode Gender  LYV Active
   1:         GY      M 2019   5742
   2:         KI      M 2019 244077
   3:         KI      F 2019 266944
   4:         CR      M 2018  51993
   5:         GY      M 2008    150
  ---                              
2172:         WT      U 2017      1
2173:         WK      M 2005      1
2174:         YA      U 1900     28
2175:         WL      U 1900      5
2176:         WK      U 1900      2

# Selecting group sums by gender for particular years:
z1[LYV %in% c("1900","2008","2012","2016","2017","2018","2019"),
.(Gender,LYV,Active)][,dcast(.SD,Gender ~ LYV,value.var="Active",fun.aggregate=sum)]

   Gender   1900  2008  2012   2016  2017   2018   2019
1:      F 275845 15694 43851 191024 27996 927968 777369
2:      M 307010 14543 41069 165942 24837 849066 688101
3:      U   6183    22    94   1161   233   5589   4804

Recent improvements in 'dcast' have given it a lot of flexibility. Using brackets to pipe it to the recursed data.table ('.SD') is frowned against in some documents I have read. But it works just fine for me. You might be able to simply use a 'set' command and 'dcast' to achieve your optimization requirements without manual parallelization.

Running a user defined for-loop function in parallel on grouped data.table

2 Answers