40
votes

I have two data frames(df and df1). df1 is subset of df. I want to get a data frame which is complement of df1 in df, i.e. return rows of the first data set which are not matched in the second. For example let,

data frame df:

heads
row1
row2
row3
row4
row5

data frame df1:

heads
row3
row5

Then the desired output df2 is:

heads
row1
row2
row4
7

7 Answers

38
votes

Try anti_join from dplyr

library(dplyr)
anti_join(df, df1, by='heads')
70
votes

You could also do some type of anti join with data.tables binary join

library(data.table)
setkey(setDT(df), heads)[!df1]
#    heads
# 1:  row1
# 2:  row2
# 3:  row4

EDIT: Starting data.table v1.9.6+ we can join data.tables without setting keys while using on

setDT(df)[!df1, on = "heads"]

EDIT2: Starting data.table v1.9.8+ fsetdiff was introduced which is basically a variation of the solution above, just over all the column names of the x data.table, e.g. x[!y, on = names(x)]. If all set to FALSE (the default behavior), then only unique rows in x will be returned. For the case of only one column in each data.table the following will be equivalent to the previous solutions

fsetdiff(df, df1, all = TRUE)
25
votes

Try the %in% command and reverse it with !

df[!df$heads %in% df1$heads,]
9
votes

Another option, using base R and the setdiff function:

df2 <- data.frame(heads = setdiff(df$heads, df1$heads))

setdiff functions exactly as you would imagine; take both arguments as sets, and remove all items in the second from the first.

I find setdiff more readable tahtn %in% and prefer not to require additional libraries when I don't need them, but which answer you use is largely a question of personal taste.

5
votes

dplyr also has setdiff() which will get you the

enter image description here

setdiff(bigFrame, smallFrame) gets you the extra records in the first table.

so for the OP's example the code would read setdiff(df, df1)

dplyr has a lot of great functionality: for a quick easy guide see here.

4
votes

Late answer, but for another option we can try doing a formal SQL anti join, using the sqldf package:

library(sqldf)
sql <- "SELECT t1.heads
        FROM df t1 LEFT JOIN df1 t2
            ON t1.heads = t2.heads
        WHERE t2.heads IS NULL"
df2 <- sqldf(sql)

The sqldf package can be useful for those problems which are easily phrased using SQL logic, but perhaps less easily phrased using base R or another R package.

3
votes

Another option by creating a function negate_match_df by manipulating the code of match_df of plyr package.

library(plyr)
negate_match_df <- function (x, y, on = NULL) 
{
if (is.null(on)) {
    on <- intersect(names(x), names(y))
    message("Matching on: ", paste(on, collapse = ", "))
}
keys <- join.keys(x, y, on)
x[!keys$x %in% keys$y, , drop = FALSE]
}

Data

df <- read.table(text ="heads
row1
row2
row3
row4
row5",header=TRUE)

df1 <- read.table(text ="heads
row3
row5",header=TRUE)

Output

negate_match_df(df,df1)