R data.table efficient replication by group

Question

I am running into some memory allocation problems trying to replicate some data by groups using data.table and rep.

Here is some sample data:

ob1 <- as.data.frame(cbind(c(1999),c("THE","BLACK","DOG","JUMPED","OVER","RED","FENCE"),c(4)),stringsAsFactors=FALSE)
ob2 <- as.data.frame(cbind(c(2000),c("I","WALKED","THE","BLACK","DOG"),c(3)),stringsAsFactors=FALSE)
ob3 <- as.data.frame(cbind(c(2001),c("SHE","PAINTED","THE","RED","FENCE"),c(1)),stringsAsFactors=FALSE)
ob4 <- as.data.frame(cbind(c(2002),c("THE","YELLOW","HOUSE","HAS","BLACK","DOG","AND","RED","FENCE"),c(2)),stringsAsFactors=FALSE)
sample_data <- rbind(ob1,ob2,ob3,ob4)
colnames(sample_data) <- c("yr","token","multiple")

What I am trying to do is replicate the tokens (in the present order) by the multiple for each year.

The following code works and gives me the answer I want:

good_solution1 <- ddply(sample_data, "yr", function(x) data.frame(rep(x[,2],x[1,3])))

good_solution2 <- data.table(sample_data)[, rep(token,unique(multiple)),by = "yr"]

The issue is that when I scale this up to 40mm+ rows, I get into memory issues for both possible solutions.

If my understanding is correct, these solutions are essentially doing an rbind which allocates everytime.

Does anyone have a better solution?

I looked at set() for data.table but was running into issues because I wanted to keep the tokens in the same order for each replication.

Since version 1.9.2. (on CRAN 27 Feb 2014), data.table has gained a new function setDT() which takes a list or data.frame and changes its type by reference to data.table, without any copy. So, setDT(sample_data) instead of data.table(sample_data) may help to save memory. — Uwe

Arun Arun · Accepted Answer · 2013-03-24T01:18:26

One way is:

require(data.table)
dt <- data.table(sample_data)
# multiple seems to be a character, convert to numeric
dt[, multiple := as.numeric(multiple)]
setkey(dt, "multiple")
dt[J(rep(unique(multiple), unique(multiple))), allow.cartesian=TRUE]

Everything except the last line should be straightforward. The last line uses a subset using key column with the help of J(.). For each value in J(.) the corresponding value is matched with "key column" and the matched subset is returned.

That is, if you do dt[J(1)] you'll get the subset where multiple = 1. And if you note carefully, by doing dt[J(rep(1,2)] gives you the same subset, but twice. Note that there's a difference between passing dt[J(1,1)] and dt[J(rep(1,2)]. The former is matching values of (1,1) with the first-two-key-columns of the data.table respectively, where as the latter is subsetting by matching (1 and 2) against the first-key column of the data.table.

So, if we were to pass the same value of the column 2 times in J(.), then it gets be duplicated twice. We use this trick to pass 1 1-time, 2 2-times etc.. and that's what the rep(.) part does. rep(.) gives 1,2,2,3,3,3,4,4,4,4.

And if the join results in more rows than max(nrow(dt), nrow(i)) (i is the rep vector that's inside J(.)), you've to explicitly use allow.cartesian = TRUE to perform this join (I guess this is a new feature from data.table 1.8.8).

Edit: Here's some benchmarking I did on a "relatively" big data. I don't see any spike in memory allocations in both methods. But I've yet to find a way to monitor peak memory usage within a function in R. I am sure I've seen such a post here on SO, but it slips me at the moment. I'll write back again. For now, here's a test data and some preliminary results in case anyone is interested/wants to run it for themselves.

# dummy data
set.seed(45)
yr <- 1900:2013
sz <- sample(10:50, length(yr), replace = TRUE)
token <- unlist(sapply(sz, function(x) do.call(paste0, data.frame(matrix(sample(letters, x*4, replace=T), ncol=4)))))
multiple <- rep(sample(500:5000, length(yr), replace=TRUE), sz)

DF <- data.frame(yr = rep(yr, sz), 
                 token = token, 
                 multiple = multiple, stringsAsFactors=FALSE)

# Arun's solution
ARUN.DT <- function(dt) {
    setkey(dt, "multiple")
    idx <- unique(dt$multiple)
    dt[J(rep(idx,idx)), allow.cartesian=TRUE]
}

# Ricardo's solution
RICARDO.DT <- function(dt) {
    setkey(dt, key="yr")
    newDT <- setkey(dt[, rep(NA, list(rows=length(token) * unique(multiple))), by=yr][, list(yr)], 'yr')
    newDT[, tokenReps := as.character(NA)]

    # Add the rep'd tokens into newDT, using recycling
    newDT[, tokenReps := dt[.(y)][, token], by=list(y=yr)]
    newDT
}

# create data.table
require(data.table)
DT <- data.table(DF)

# benchmark both versions
require(rbenchmark)
benchmark(res1 <- ARUN.DT(DT), res2 <- RICARDO.DT(DT), replications=10, order="elapsed")

#                     test replications elapsed relative user.self sys.self
# 1    res1 <- ARUN.DT(DT)           10   9.542    1.000     7.218    1.394
# 2 res2 <- RICARDO.DT(DT)           10  17.484    1.832    14.270    2.888

But as Ricardo says, it may not matter if you run out of memory. So, in that case, there has to be a trade-off between speed and memory. What I'd like to verify is the peak memory used in both methods here to say definitively if using Join is better.

R data.table efficient replication by group

2 Answers

Two notes: