2
votes

I have a complex block of dplyr code that has worked successfully on a data frame containing 5,200,000 rows. Since I wrote the code I have updated my R version from 3.1.2 to 3.2.0, and currently using Revolution R Open (RRO) 3.2.0

Running the code block now on the same data causes RStudio to error with

fatal error - R Session Aborted

The error occurs both under RRO 3.2.0 and normal R 3.2.0

I am equally not sure if it is the window functions used (lag & row_number) that are the culprits. I am mostly interested in finding out what is causing the statement to crash R/RStudio, as opposed to rewriting the dplyr statement, but happy to receive tips on better dplyr practices :-)

I have looked at the following questions on Stackoverflow dplyr crash when using lagged difference computation and dplyr crashes when using summarise with segfault error but don't feel they relate to my query.

I can successfully run the dplyr statement on half the data using the slice operator, and is equally successful on the other half of the data, so I don't believe it is an issue with the data.

I have been able to replicate the error on a data frame with sample data.

This is the code to generate the sample data frame DF

library(dplyr)
# create an ID column with some containing duplicate values
set.seed(1)
DF <- data.frame(ID = floor(runif(5200000, 1,3000000)))

# Order data frame by ID, YEAR
DF <- tbl_df(DF) %>%
group_by(ID) %>%
mutate(YEAR = row_number()) %>%
arrange(ID, YEAR)     

# create and event variable which is set to 0 80% of the time 1 10% etc.
DF$EVENT <- sample(0:5,5200000, replace = TRUE, prob = c(0.8, 0.1, 0.05, 0.025, 0.015, 0.01))

# create a vector of unique IDs
unique_IDs <- unique(DF$ID)
# take a 10% sample of the unique IDs
init_set <- sample(unique_IDs, replace = FALSE, size =  round(length(unique_IDs)*0.1) )
# create an index of the 10% sample IDs
init.idx <- DF$ID %in% init_set

# create an initialisation state variable with Y and N values
DF$INIT_STATE <- as.factor(ifelse(init.idx,"Y","N"))

The dplyr statement I am running looks as follows:

tbl_df(DF) %>%     
    select(ID, YEAR, EVENT, INIT_STATE) %>%
    # slice(1:2600000) %>%
    group_by(ID) %>%                                                    # group by ID to control window functions
    arrange(ID, YEAR) %>%                                               # sort by ID, YEAR (just to be sure, may not be needed)
    mutate(event_lag = lag(EVENT)                                       # add attr which shifts the event number by a lag of 1 (YEAR_COUNTER is set to zero in the year after the event)
           , event_lag = ifelse(is.na(event_lag), 0, event_lag) ) %>%   # first lag in the ID group is NA, this sets it to 0
    mutate(i = cumsum(ifelse(event_lag, 1, 0))) %>%                     # create cumulative count of lagged number of events (used for grouping)
    group_by(i, add = TRUE) %>%                                         # now add another clause to the group by 
    mutate(row_rank = row_number()                                      # row_number is a counter that restarts in every group (ID and i)
           , year_ini = ifelse(i == 0 & INIT_STATE == "N", 5, 0)        # add attribute that determines if the EVENT_COUNTER starts at 5 yrs or 0 yrs
           , YEAR_COUNTER = year_ini + row_rank - 1) %>%                # the EVENT_COUNTER is now the sum between the EVENT initialisation + the row counter. -1 starts counter from 0
    select(-(event_lag:year_ini))  

I have added comments against each line in the dplyr statement to indicate what is intended with each step.

A successful run on half the data looks as follows:

Source: local data frame [2,600,000 x 6]
Groups: ID, i

   i ID YEAR EVENT INIT_STATE YEAR_COUNTER
1  0  1    1     1          N            5
2  1  1    2     0          N            0
3  1  1    3     0          N            1
4  1  1    4     0          N            2
5  1  1    5     0          N            3
6  0  2    1     0          N            5
7  0  3    1     0          N            5
8  0  3    2     0          N            6
9  0  3    3     2          N            7
10 0  4    1     0          N            5
.. . ..  ...   ...        ...          ...

In addition to the session info below, I have 192Gb RAM on the server, and I don't see any significant spikes in the memory usage while the dplyr statement runs.

sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.4.1

loaded via a namespace (and not attached):
[1] lazyeval_0.1.10 magrittr_1.5    assertthat_0.1  parallel_3.2.0  DBI_0.3.1       tools_3.2.0     Rcpp_0.11.6 
1
I have just update to RRO 3.2.1 and dplyr 0.4.2 and this appears to have solved the problem, which is great news.KAE

1 Answers

1
votes

I have just updated to RRO 3.2.1 and dplyr 0.4.2 and this appears to have solved the problem, which is great news. Thanks to anyone who looked at the question.