417
votes

When I need to filter a data.frame, i.e., extract rows that meet certain conditions, I prefer to use the subset function:

subset(airquality, Month == 8 & Temp > 90)

Rather than the [ function:

airquality[airquality$Month == 8 & airquality$Temp > 90, ]

There are two main reasons for my preference:

  1. I find the code reads better, from left to right. Even people who know nothing about R could tell what the subset statement above is doing.

  2. Because columns can be referred to as variables in the select expression, I can save a few keystrokes. In my example above, I only had to type airquality once with subset, but three times with [.

So I was living happy, using subset everywhere because it is shorter and reads better, even advocating its beauty to my fellow R coders. But yesterday my world broke apart. While reading the subset documentation, I notice this section:

Warning

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

Could someone help clarify what the authors mean?

First, what do they mean by "for use interactively"? I know what an interactive session is, as opposed to a script run in BATCH mode but I don't see what difference it should make.

Then, could you please explain "the non-standard evaluation of argument subset" and why it is dangerous, maybe provide an example?

2
It is slightly less (but nut less than subset) to use with, with(airquality, airquality[Month == 8 & Temp > 90, ])Tyler Rinker
You might also have a look at Cirlces 8.2.31 and 8.2.32 of 'The R Inferno' burns-stat.com/pages/Tutor/R_inferno.pdfPatrick Burns
Try data.table instead, the default syntax is like airquality[Month == 8 & Temp > 90,] - very readable, and much faster.Stian Håklev
OK. so if subset is bad to use - what about [ vs. dplyr::filter() ?userJT
For those wondering, dplyr::filter has the same problem. I.e. if the environment happens to have a variable with that name, it will use it instead of the variable in the data frame. Makes for confusing debugging!CoderGuy123

2 Answers

252
votes

This question was answered in well in the comments by @James, pointing to an excellent explanation by Hadley Wickham of the dangers of subset (and functions like it) [here]. Go read it!

It's a somewhat long read, so it may be helpful to record here the example that Hadley uses that most directly addresses the question of "what can go wrong?":

Hadley suggests the following example: suppose we want to subset and then reorder a data frame using the following functions:

scramble <- function(x) x[sample(nrow(x)), ]

subscramble <- function(x, condition) {
  scramble(subset(x, condition))
}

subscramble(mtcars, cyl == 4)

This returns the error:

Error in eval(expr, envir, enclos) : object 'cyl' not found

because R no longer "knows" where to find the object called 'cyl'. He also points out the truly bizarre stuff that can happen if by chance there is an object called 'cyl' in the global environment:

cyl <- 4
subscramble(mtcars, cyl == 4)

cyl <- sample(10, 100, rep = T)
subscramble(mtcars, cyl == 4)

(Run them and see for yourself, it's pretty crazy.)

32
votes

Also [ is faster:

require(microbenchmark)        
microbenchmark(subset(airquality, Month == 8 & Temp > 90),airquality[airquality$Month == 8 & airquality$Temp > 90,])
    Unit: microseconds
                                                           expr     min       lq   median       uq     max neval
                     subset(airquality, Month == 8 & Temp > 90) 301.994 312.1565 317.3600 349.4170 500.903   100
     airquality[airquality$Month == 8 & airquality$Temp > 90, ] 234.807 239.3125 244.2715 271.7885 340.058   100