Subset data frame based on unique combination of multiple conditions

Question

I cant seem to find an answer through search to this on SO. I'm trying to select a subset of a data.frame based on four conditions (lon1, lon2, lat1 and lat2). I have a huge dissimilarity matrix that has been vectorized and the sites (lon1, lon2, lat1 and lat2) cbind to it. Here is an example data frame:

out1 <- data.frame(lon1 = sample(1:10), lon2 = sample(1:10), 
                   lat1 = sample(1:10), lat2 = sample(1:10), 
                   dissimilarity = sample(seq(0,1,.1),10))
> out1
     lon1   lon2    lat1 lat2 dissimilarity
1     2      6      4      4           0.6
2     4      2      1      3           1.0
3    10      9      2      6           0.0
4     3      1     10      8           0.5
5     9      5      9      1           0.8
6     5      7      5      9           0.9
7     1      8      6      7           0.2
8     8      3      8      5           0.7
9     7      4      3     10           0.3
10    6     10      7      2           0.1

out2 <- out1[c(2,5,6,8),]

   lon1 lon2 lat1 lat2 dissimilarity
1     4   2   1      3           1.0
2     9   5   9      1           0.8
3     5   7   5      9           0.9
4     8   3   8      5           0.7

I tried using %in% function a few times in this manner:

test <- out1[(out1$lon1 %in% out2$lon1) & (out1$lon2 %in% out2$lon2) & 
             (out1$lat1 %in% out2$lat1) & (out1$lat2 %in% out2$lat2), ]

This seems to work for the basic example I provide here. But, when I apply it to my huge data frame (with many lat and lons repeated) I get back a larger subset than the unqiue combinations I require. I assume because the match function in %in% can only match a vector. So it's matching condition1 & condition2 & condition3 & condition4 And thus is returning a results that gives a subset which is the same as the orginal out1. I want to get only the case when all four values are the same for that row. This way I'll get a subset of the data for the pairwise dissimilarities I'm interested in.

Any ideas on how to subset by row based on a unique combination of four variables would be greatly appreciated.

can you be more specific, I dont understand because all the 4 conditions are same in your case. Is test your expected output? — Nishanth
@e4e5f4, he wants the whole row to be matched. by doing individual comparisons and &ing them, you get all possible combinations, not necessarily the same identical elements in a row. It's not different from this question (to which we both answered) — Arun

Arun Arun · Accepted Answer · 2013-05-13T09:12:33

I think this is what you're looking for. Basically you want duplicated function that returns what you're expecting.

out1[duplicated(rbind(out2, out1)[, 1:4])[-seq_len(nrow(out2))], ]

How does it work? First we rbind out2 and out1. Then call duplicated on it. The columns that are in out2 and in out1 will be marked as TRUE in out1. This is because the first occurrence is on out2 and it was not duplicated there. But the second time it finds the entry, it will be in out1 and so it'll know there has been a row exactly like this before. So, it'll mark it as duplicated. We now have all duplicated entries. From this we subset only the elements of out1 by removing the first n elements where n = nrow(out1). Then we subset using this logical vector on out1.

You can go through this explanation and run the code step by step to follow-up. Here's a break-down version for working out the logic.

tt <- rbind(out2, out1)
tt.dup <- duplicated(tt[, 1:4)] # marks all duplicate rows in out1 from 1st 4 cols
tt.dup <- tt.dup[-seq_len(nrow(out2))] # remove all out2 entries (first n)
out1[tt.dup, ] # index only TRUE/duplicated elements from out1

Subset data frame based on unique combination of multiple conditions

1 Answers