3
votes

dplyr is the only package that can handle my 843k data.frame and query it in a fast way. I can filter fine using some math and equal criteria, however I need to implement a search for a concept.

I need something like this sqldf query

library(sqldf)
head(iris)
sqldf("select * from iris where lower(Species) like '%nica%'")

In dplyr help I was not able to find how I could do it. something like:

filter(iris,Species like '%something%')

The starting and ending % is very important. Also, note that the data frame has 800+k rows so traditional R functions may run slow. It has to bee a dplyr based solution.

2
dplyr is the only package that can handle my 843k data.frame - I highly suggest the R package data.table.nrussell
I made a decision to focus on dplyr syntax and "skip data.table" My whole app is commited to dplyr so data.table solution is OK but native dplyr is best. I think dplyr IS using data.table under the hood.userJT
dplyr can work with data tables, but unless you explicitly load data.table and convert your data.frames to data.tables, it won't use it under the hood. Doing so might give you additional speed gains.Gregor Thomas

2 Answers

6
votes

What about this -

library(dplyr)
data(iris)
filter(iris, grepl("nica",Species))

EDIT: Another option - the function %like% in data.table()

library(dplyr)
data(iris)
##
Iris <- iris[
  rep(seq_len(nrow(iris)),each=5000),
  ]
dim(Iris)
[1] 750000      5
##
library(microbenchmark)
library(data.table)
##
Dt <- data.table(Iris)
setkeyv(Dt,cols="Species")
##
foo <- function(){
  subI <- filter(Iris, grepl("nica",Species))
}
##
foo2 <- function(){
  subI <- Dt[Species %like% "nica"]
}
##
foo3 <- function(){
  subI <- filter(Iris, Species %like% "nica")
}
Res <- microbenchmark(
  foo(),foo2(),foo3(),
  times=100L)
##
> Res
Unit: milliseconds
   expr       min        lq    median        uq      max neval
  foo() 114.31080 122.12303 131.15523 136.33254 214.0405   100
 foo2()  23.00508  30.33685  39.77843  41.49121 129.9125   100
 foo3()  18.84933  22.47958  29.39228  35.96649 114.4389   100
3
votes

full code would be (including the lowercase)

require(data.table)
iris %>% filter(tolower(Species) %like% 'nica')