Solution to destructive filtering of a large table

Question

I have a problem where I need to select and save part of a table based on one column then eliminate rows from the source table that match values in one of the columns of the saved table.

I've found that dplyr and data.table are slower than base R and am wondering if I'm doing something wrong here (an anti-pattern I'm not aware of?) or if someone knows a faster solution to this.

I need to scale it up to ~10 million rows in the search df and ~10k iterations of the y_unique search.

Here's a reasonable reproducible example...

(edit: I realized that what I was doing could be achieved through a group filter. Leaving an updated reproducible example with some tweaks from the comments below and my updated solution. -- Note that the original did not include the bind_cols(y_list) detail. In retrospect, I should have included that in this example.)

library(dplyr)
library(data.table)
library(microbenchmark)

microbenchmark(base = {
  for(y_check in y_unique) {
    y_list[[as.character(y_check)]] <- df[df$y == y_check, ]
    df <- df[!df$x %in% y_list[[as.character(y_check)]]$x, ]
  }
  out <- bind_rows(y_list)
}, dplyr = {
  for(y_check in y_unique) {
    y_list[[as.character(y_check)]] <- filter(df, y == y_check)
    df <- df[!df$x %in% y_list[[as.character(y_check)]]$x, ]
  }
  out <- bind_rows(y_list)
}, data.table = {
  for(y_check in y_unique) {
    y_list[[as.character(y_check)]] <- dt[y == y_check]
    dt <- dt[!x %in% y_list[[as.character(y_check)]]$x]
  }
  out <- do.call(rbind, y_list)
}, alternate = {
  df <- group_by(df, x)
  out <- filter(df, y == min(y))
}, times = 10, setup = {
  set.seed(1)
  df <- data.frame(x = sample(1:1000, size = 1000, replace = TRUE),
                   y = sample(1:100, size = 1000, replace = TRUE))
  dt <- data.table(df)
  y_unique <- sort(unique(df$y))
  y_list <- setNames(rep(list(list()), length(y_unique)), y_unique)
})

I get:

Unit: milliseconds
       expr        min        lq       mean     median        uq        max neval
       base  12.939135  13.22883  13.623098  13.500897  13.95468  14.517167    10
      dplyr  41.517351  42.22595  50.041123  45.199978  61.33194  65.927611    10
 data.table 228.014360 233.98309 248.281965 240.172383 263.39943 287.706941    10
  alternate   3.310031   3.42016   3.745013   3.454537   4.17488   4.497455    10

On my real data, I get more or less the same. The base is 2+ times faster than dplyr and data.table is... slow. Any ideas?

If you're worried about speed, I would strongly recommend testing at scale, or at least closer to scale. In many cases data.table is slower at small scale (when the difference is in milliseconds) and faster at large scale (when the difference is in seconds or minutes---i.e., where it counts). — Gregor Thomas
data.table can also be context specific. Much of the data.table time in this benchmark is probably tied up in the data.table conversion. But if you're doing more than this one transformation, converting to data.table will speed up all your operations. (And if you're data is read in from a file, using data.table::fread will produce a data.table more quickly than any other import method will produce a data frame, so holistically it may not make sense to include data.table() in the benchmark — Gregor Thomas
Also, see ?split and ?split.data.table, and ?dplyr::group_split. This question is highly relevant to you. — Gregor Thomas
Hmmm.. perhaps I've misunderstood the iterative nature? Do you expect the filtered df to have anything in it at the end? Or is the point filtering it for the next iteration? — Gregor Thomas
Adding to what @Gregor said, coercion will be an overhead not only during data.frame to data.table conversion, but also during searching, because you're doing as.character(y_unique) and then using it to search integers. You should leave it as integer and use y_list[[as.character(y_check)]]. Moreover, data.table is probably not made for this kind of scenarios, check this question and its answer. — Alexis

chinsoon12 chinsoon12 · Accepted Answer · 2019-07-08T01:59:59

A few options using join (around 13s with any join method for the actual dimensions):

DT <- copy(dt)
setorder(DT, y, x)
DT[DT[.(unique(x)), on=.(x), .(y=first(y)), by=.EACHI], on=.(x,y)]

or if original ordering matters:

DT2 <- copy(dt)
setorder(DT2[, rn := .I], y, x)
dt[sort(DT2[.(unique(x)), on=.(x), rn[y==first(y)], by=.EACHI]$V1)]

and also using the min mentioned in OP:

DT0[, rn := .I]
dt[DT0[.(unique(x)), on=.(x), rn[y==min(y)], by=.EACHI][order(V1), V1]]

timing code:

base <- function() {
    for(y_check in y_unique) {
        y_list[[as.character(y_check)]] <- df[df$y == y_check, ]
        df <- df[!df$x %in% y_list[[as.character(y_check)]]$x, ]
    }
    do.call(rbind, y_list)
} #base

mtd0 <- function() {
    for(y_check in y_unique) {
        y_list[[as.character(y_check)]] <- dt[y == y_check]
        dt <- dt[!x %in% y_list[[as.character(y_check)]]$x]
    }
    out <- rbindlist(y_list)
} #mtd0

join_mtd <- function() {
    setorder(DT, y, x)
    dt[DT[.(unique(x)), on=.(x), .(y=first(y)), by=.EACHI], on=.(x,y)]
} #join_mtd

join_mtd2 <- function() {
    setorder(DT2[, rn := .I], y, x)
    dt[sort(DT2[.(unique(x)), on=.(x), rn[y==first(y)], by=.EACHI]$V1)]
} #join_mtd2

join_mtd3 <- function() {
    DT0[, rn := .I]
    dt[DT0[.(unique(x)), on=.(x), rn[y==min(y)], by=.EACHI][order(V1), V1]]
} #join_mtd3

bench::mark(base(), data.table_0=mtd0(), 
    jm=join_mtd(), jm2=join_mtd2(), jm3=join_mtd2(), check=FALSE)

checks:

baseans <- setDT(base())
data.table_0 <- mtd0()
ordbase <- setorder(copy(baseans), y, x)
jm <- join_mtd()
jm2 <- join_mtd2()
jm3 <- join_mtd3()

identical(baseans, data.table_0)
#[1] TRUE
identical(ordbase, setorder(jm, y, x))
#[1] TRUE
identical(ordbase, setorder(jm2, y, x))
#[1] TRUE
identical(ordbase, setorder(jm3, y, x))
#[1] TRUE

timings:

# A tibble: 5 x 14
  expression        min     mean   median      max `itr/sec` mem_alloc  n_gc n_itr total_time result                   memory                time    gc            
  <chr>        <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list>                   <list>                <list>  <list>        
1 base()         38.59s   38.59s   38.59s   38.59s    0.0259    27.3GB   308     1     38.59s <data.frame [632,329 x ~ <Rprofmem [43,206 x ~ <bch:t~ <tibble [1 x ~
2 data.table_0   24.65s   24.65s   24.65s   24.65s    0.0406      14GB   159     1     24.65s <data.table [632,329 x ~ <Rprofmem [72,459 x ~ <bch:t~ <tibble [1 x ~
3 jm              1.28s    1.28s    1.28s    1.28s    0.779       75MB     7     1      1.28s <data.table [632,329 x ~ <Rprofmem [2,418 x 3~ <bch:t~ <tibble [1 x ~
4 jm2             1.44s    1.44s    1.44s    1.44s    0.696     62.5MB     9     1      1.44s <data.table [632,329 x ~ <Rprofmem [1,783 x 3~ <bch:t~ <tibble [1 x ~
5 jm3             1.57s    1.57s    1.57s    1.57s    0.636     62.5MB     9     1      1.57s <data.table [632,329 x ~ <Rprofmem [178 x 3]>  <bch:t~ <tibble [1 x ~

data:

library(data.table)
library(bench)

set.seed(1L)
nr <- 10e6/10
ni <- 10e3/10
df <- data.frame(x = sample(nr, size = nr, replace = TRUE),
    y = sample(ni, size = nr, replace = TRUE))
dt <- data.table(df)
DT0 <- copy(dt)
DT <- copy(dt)
DT2 <- copy(dt)

y_unique <- sort(unique(df$y))
y_list <- setNames(rep(list(list()), length(y_unique)), y_unique)

Solution to destructive filtering of a large table

1 Answers