Distinct rows based on date range

Question

I have a large-ish dataset with a structure something like the following:

structure(list(pathogen = c("MRSA", "L. pneumophila", "MRSA", 
"L. pneumophila", "MRSA", "MRSA", "MRSA", "L. pneumophila", "L. pneumophila", 
"MRSA"), variant = c("mecA", "sg1", "mecA", "sg1", "mecA", "mecC", 
"mecA", "sg1", "sg6", "mecA"), n = c(25L, 14L, 235L, 2L, 64L, 
15L, 13L, 6L, 11L, 8L), date = structure(c(15156, 15248, 15279, 
15279, 15309, 15340, 15340, 15400, 15431, 15461), class = "Date")), .Names = c("pathogen", 
"variant", "n", "date"), row.names = c(NA, -10L), class = "data.frame")

I want to find every row with a combination of variables that has not been recorded in the previous x-month period. So when I look for a combination of pathogen and variant that has not been recorded in the previous 3 months I go from:

         pathogen variant   n       date
1            MRSA    mecA  25 2011-07-01
2  L. pneumophila     sg1  14 2011-10-01
3            MRSA    mecA 235 2011-11-01
4  L. pneumophila     sg1   2 2011-11-01
5            MRSA    mecA  64 2011-12-01
6            MRSA    mecC  15 2012-01-01
7            MRSA    mecA  13 2012-01-01
8  L. pneumophila     sg1   6 2012-03-01
9  L. pneumophila     sg6  11 2012-04-01
10           MRSA    mecA   8 2012-05-01

to:

         pathogen variant   n       date
1            MRSA    mecA  25 2011-07-01
2  L. pneumophila     sg1  14 2011-10-01
3            MRSA    mecA 235 2011-11-01
6            MRSA    mecC  15 2012-01-01
8  L. pneumophila     sg1   6 2012-03-01
9  L. pneumophila     sg6  11 2012-04-01
10           MRSA    mecA   8 2012-05-01

All the solutions I've thought of so far involve writing loops. I'm also trying to use dplyr for analysis as much as possible, so my question is: is this possible in dplyr? And if not, what would an R-ish approach look like?

Would it meet your needs to divide your date range into x-month intervals and report those combinations that appear in each interval? — Sam Firke
I did consider doing something like that, using the cut function, but if a combination of variables is present three months running and one of the breaks falls within that period I'll get repeats that I don't want. — stuwest

talat talat · Accepted Answer · 2015-04-09T11:51:22

I'm not sure about the best way to handle exact month intervalls, but to get you started, you could compute the difference in number of days as follows (resulting in the same output as shown in the question):

df %>%
  group_by(pathogen, variant) %>%
  filter(c(TRUE, diff(date) > 90))   # check for difference of 90 days
#Source: local data frame [7 x 4]
#Groups: pathogen, variant
#
#        pathogen variant   n       date
#1           MRSA    mecA  25 2011-07-01
#2 L. pneumophila     sg1  14 2011-10-01
#3           MRSA    mecA 235 2011-11-01
#4           MRSA    mecC  15 2012-01-01
#5 L. pneumophila     sg1   6 2012-03-01
#6 L. pneumophila     sg6  11 2012-04-01
#7           MRSA    mecA   8 2012-05-01

Distinct rows based on date range

1 Answers