Pairwise date-time row comparison by group using combn and apply R

Question

I would like to do a pairwise comparison of every row by group for the difference in a date.time variable.

I have a data frame composed of a "site" variable, a dummy "species" variable, and a POSIXcT "date.time" variable, in ascending order. Each row has a different species, as I'm interested in the time difference between different species visiting a site. 3cols, 50rows added here

structure(list(site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("act_0041", 
"act_0048", "ACT0009", "ACT0035", "ACT0041"), class = "factor"), 
    species = c(12, 14, 28, 6, 34, 29, 27, 22, 35, 9, 16, 2, 
    32, 33, 6, 2, 29, 10, 34, 9, 22, 28, 32, 23, 33, 6, 10, 27, 
    12, 34, 32, 31, 10, 30, 6, 14, 35, 8, 23, 32, 12, 34, 22, 
    1, 13, 18, 6, 34, 27, 11), date.time = structure(c(1531454862, 
    1535035906, 1535348634, 1536254587, 1537580136, 1539047529, 
    1539335947, 1542708373, 1545597646, 1548570870, 1548862522, 
    1548970932, 1548970934, 1530624228, 1536088381, 1536270537, 
    1538374649, 1538705865, 1543254377, 1544755701, 1545263758, 
    1546425304, 1546490305, 1530393638, 1531013434, 1532049165, 
    1537459670, 1545803958, 1546142278, 1560118590, 1560203862, 
    1530431347, 1531031939, 1533129189, 1533975327, 1534157098, 
    1535229634, 1535594837, 1536352632, 1536355007, 1536397768, 
    1536707407, 1537231673, 1562873882, 1531454862, 1531595641, 
    1536254587, 1537732697, 1538760001, 1540317399), class = c("POSIXct", 
    "POSIXt"), tzone = "")), row.names = c(NA, -50L), class = c("tbl_df", 
"tbl", "data.frame"))

I want to compare all time differences between rows representing different species, within a group, and create a df or matrix where I have a group (e.g., "site"), each pairwise species combination, and the difference in time between them. E.g., for site a, I have the time difference between species a and species b, species b and c, species a and c etc. If I can avoid duplicating interactions (e.g. site a, species b and c, site a, species c and b) that would be great.

structure(list(site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "act_0041", class = "factor"), 
    species_interaction = structure(c(1L, 2L, 7L, 12L, 10L, 8L, 
    6L, 5L, 11L, 13L, 3L, 4L, 9L, 14L, 15L, 17L, 16L), .Label = c("12,12", 
    "12,14", "12,16", "12,2", "12,22", "12,27", "12,28", "12,29", 
    "12,32", "12,34", "12,35", "12,6", "12,9", "28,14", "28,28", 
    "28,34", "28,6"), class = "factor"), time.diff..mins. = c(NA, 
    350L, 6502L, 150L, 2065L, 52L, 630L, 542L, 2584L, 241L, 340L, 
    3689L, 201L, 31L, NA, 28L, 356L)), class = "data.frame", row.names =c(NA,-17L))

I've been trying with combn and apply, but I am only getting a matrix of empty integer and values 1,2,3 etc. I know I'm missing something to do with the date.time aspect and maybe displaying df.

df %>% group_by(site)
comb <- t(combn(nrow(as.data.frame(df$species)),2))
dx <- apply(comb, 1, function(x) df[x[1], -1] - df[x[2], -1])
dt <- cbind(comb, dx)

I have been trying to apply this example, though I think I'm missing a simple piece. Any help would be great

m.evans m.evans · Accepted Answer · 2020-04-27T21:41:06

One error in your code is that your call to the combn function should be the species column, not nrow(species column), which results in just an integer and is why you are only getting integers there.

However, I don't think just fixing that will give you the result you want, so here is an example using functions from the dplyr package.

First, I created some fake data that matched yours to use:

fake.df <- data.frame(site = rep(LETTERS[1:5], each = 20),
                      species = sample(1:10, 100, replace = T),
                      date.time = as.POSIXct(1531454861+seq(500,50000, by = 500), tz = "", origin  = "1960-01-01"))

For ease of reading, I'm just going to use a for-loop that iterates over each of your sites, identifies species interactions pairs within that site, and calculates the shortest time difference between visits. I specify it is the shortest because you mentioned that you didn't want duplicates, but this fake data will sometimes have species making multiple visits to the site, so we are taking the shortest.

#list object to save iteration by site into
new.dfs <- list()
for(i in 1:length(unique(fake.df$site))){
  this.site <- unique(fake.df$site)[i]
  this.visit <- fake.df[fake.df$site == this.site,]

  #get unique visits by species
  this.pairs <- unique(t(combn(this.visit$species,2)))

  #combine this with the time data for each of these
  this.pair.times <- data.frame(this.pairs) %>%
    rename(species_1 = X1, species_2 = X2) %>%
    #time of first visit
    left_join(this.visit, by = c("species_1" = "species")) %>%
    rename(time_1 = date.time) %>%
    #time of second visit
    left_join(this.visit, by = c("species_2" = "species")) %>%
    rename(time_2 = date.time) %>%
    #calculate time difference
    mutate(time_diff = abs(time_1-time_2)) %>%
    #create ID column of species pairs
    rowwise() %>%
    mutate(species_pair = paste(min(species_1, species_2), max(species_1, species_2), sep = "_")) %>%
    #drop duplicates, keep first visit only (smallest time difference)
    group_by(species_pair) %>%
    arrange(time_diff) %>%
    slice(1) %>%
    #select relevant columns
    select(site = site.x, species_pair, time_diff)

  new.dfs[[i]] <- this.pair.times
}

#combine into one large data.frame
result.df <- bind_rows(new.dfs)

This results in the following dataset:

site  species_pair time_diff
  <fct> <chr>        <drtn>   
1 A     1_1             0 secs
2 A     1_10         1000 secs
3 A     1_2          2000 secs
4 A     1_3           500 secs
5 A     1_4          2000 secs
6 A     1_5           500 secs

Edit[2020-4-30]: keep duplicates

If you want to keep the duplicates and also have the species_pair arranged in the order of arrival (e.g. if species 5 arrived before species 1 it would read 5_1):

#list object to save iteration by site into
new.dfs <- list()
for(i in 1:length(unique(fake.df$site))){
  this.site <- unique(fake.df$site)[i]
  this.visit <- fake.df[fake.df$site == this.site,]

  #get unique visits by species
  this.pairs <- unique(t(combn(this.visit$species,2)))

  #combine this with the time data for each of these
  this.pair.times <- data.frame(this.pairs) %>%
    rename(species_1 = X1, species_2 = X2) %>%
    #time of first visit
    left_join(this.visit, by = c("species_1" = "species")) %>%
    rename(time_1 = date.time) %>%
    #time of second visit
    left_join(this.visit, by = c("species_2" = "species")) %>%
    rename(time_2 = date.time) %>%
    #calculate time difference
    mutate(time_diff = abs(time_1-time_2)) %>%
    #create ID column of species pairs, this time sensitive to order of arrival
    rowwise() %>%
    mutate(species_pair = paste(species_1, species_2, sep = "_")) %>%
    #arrange in order of arrival
    arrange(time_1, time_2) %>%
    #select relevant columns
    select(site = site.x, species_pair, time_1, time_2, time_diff)

  new.dfs[[i]] <- this.pair.times
}

#combine into one large data.frame
result.df <- bind_rows(new.dfs)

This would then be ordered first by the time of visit of the first species in the pair (time_1) and then by the time of visit of the second species in the pair (time_2).

> result.df[100:115,]
Source: local data frame [16 x 5]
Groups: <by row>

# A tibble: 16 x 5
   site  species_pair time_1              time_2              time_diff
   <fct> <chr>        <dttm>              <dttm>              <drtn>   
 1 A     8_7          2008-07-12 00:57:41 2008-07-12 01:39:21 2500 secs
 2 A     8_3          2008-07-12 00:57:41 2008-07-12 01:47:41 3000 secs
 3 A     8_8          2008-07-12 00:57:41 2008-07-12 01:56:01 3500 secs
 4 A     8_4          2008-07-12 00:57:41 2008-07-12 02:04:21 4000 secs
 5 A     8_7          2008-07-12 00:57:41 2008-07-12 02:12:41 4500 secs
 6 A     8_8          2008-07-12 00:57:41 2008-07-12 02:21:01 5000 secs
 7 A     8_9          2008-07-12 00:57:41 2008-07-12 02:29:21 5500 secs
 8 A     8_8          2008-07-12 00:57:41 2008-07-12 02:37:41 6000 secs
 9 A     8_8          2008-07-12 00:57:41 2008-07-12 02:46:01 6500 secs
10 A     8_2          2008-07-12 00:57:41 2008-07-12 02:54:21 7000 secs
11 A     6_3          2008-07-12 01:06:01 2008-07-12 00:49:21 1000 secs
12 A     6_8          2008-07-12 01:06:01 2008-07-12 00:57:41  500 secs
13 A     6_6          2008-07-12 01:06:01 2008-07-12 01:06:01    0 secs
14 A     6_4          2008-07-12 01:06:01 2008-07-12 01:14:21  500 secs
15 A     6_6          2008-07-12 01:06:01 2008-07-12 01:22:41 1000 secs
16 A     6_9          2008-07-12 01:06:01 2008-07-12 01:31:01 1500 secs

You may notice that sometimes time_2 is less than time_1 because we haven't filtered for that. If you only want visits after each species 1, then you can filter for it using filter(result.df, time_2>time_1)

Pairwise date-time row comparison by group using combn and apply R

1 Answers

Edit[2020-4-30]: keep duplicates