0
votes

I have a dataset that contains the residence period (start.date to end.date) of marked individuals (ID) at different sites. My goal is to generate a column that tells me the average number of other individuals per day that were also present at the same site (across the total residence period of each individual).

To do this, I need to determine the total number of individuals that were present per site on each date, summed across the total residence period of each individual. Ultimately, I will divide this sum by the total residence days of each individual to calculate the average. Can anyone help me accomplish this?

I calculated the total number of residence days (total.days) using lubridate and dplyr

mutate(total.days = end.date - start.date + 1)

    site    ID  start.date  end.date  total.days
1     1   16      5/24/17     6/5/17    13
2     1   46      4/30/17     5/20/17   21  
3     1   26      4/30/17     5/23/17   24
4     1   89      5/5/17      5/13/17   9
5     1   12      5/11/17     5/14/17   4
6     2   14      5/4/17      5/10/17   7
7     2   18      5/9/17      5/29/17   21
8     2   19      5/24/17     6/10/17   18
9     2   39      5/5/17      5/18/17   14
1
So what is your expected output?Maurits Evers
If I were to do it manually for the data I provided above, ID #16 was present for 13 days, from 5/24 to 6/5. On 5/24, 0 other individuals were also present at site 1. On 5/25, 0 individuals were also present...until the end date, 6/5, also 0. So the output for ID #16 would be 0 + 0... + 0 / 13 = 0. For ID #46, the start date was 4/30. On 4/30, there was 1 other individual present (for 2 individuals total, ID #16 & 26). On 5/1, there were also 2 individuals present...etc. Take the sum of all of these days and then divide by the total residency days (21) for the average. Does that make sense?antares

1 Answers

0
votes

First of all, it is always advisable to give a sample of the data in a more friendly format using dput(yourData) so that other can easily regenerate your data. Here is the output of dput() you could better be sharing:

> dput(dat)
structure(list(site = c(1, 1, 1, 1, 1, 2, 2, 2, 2), ID = c(16, 
46, 26, 89, 12, 14, 18, 19, 39), start.date = structure(c(17310, 
17286, 17286, 17291, 17297, 17290, 17295, 17310, 17291), class = "Date"), 
end.date = structure(c(17322, 17306, 17309, 17299, 17300, 
17296, 17315, 17327, 17304), class = "Date")), class = "data.frame", row.names = 
c(NA, 
-9L))

To do this easily we first need to unpack the start.date and end.date to individual dates:

newDat <- data.frame()
for (i in 1:nrow(dat)){
  expand  <-  data.frame(site = dat$site[i],
                         ID = dat$ID[i],
                         Dates = seq.Date(dat$start.date[i], dat$end.date[i], 1))
  newDat <- rbind(newDat, expand)
}

newDat
    site ID      Dates
1      1 16 2017-05-24
2      1 16 2017-05-25
3      1 16 2017-05-26
4      1 16 2017-05-27
5      1 16 2017-05-28
6      1 16 2017-05-29
7      1 16 2017-05-30
. . . 
. . .

Then we calculate the number of other individuals present in each site in each day:

individualCount = newDat %>%
                     group_by(site, Dates) %>%
                     summarise(individuals = n_distinct(ID) - 1)
individualCount
# A tibble: 75 x 3
# Groups:   site [?]
    site Dates      individuals
   <dbl> <date>           <int>
 1     1 2017-04-30           1
 2     1 2017-05-01           1
 3     1 2017-05-02           1
 4     1 2017-05-03           1
 5     1 2017-05-04           1
 6     1 2017-05-05           2
 7     1 2017-05-06           2
 8     1 2017-05-07           2
 9     1 2017-05-08           2
 10    1 2017-05-09           2
 # ... with 65 more rows

Then, we augment our data with the new information using left_join() and calculate the required average:

newDat <- left_join(newDat, individualCount, by = c("site", "Dates")) %>%
            group_by(site, ID) %>%
            summarise(duration = max(Dates) - min(Dates)+1,
                      av.individuals = mean(individuals))
newDat
# A tibble: 9 x 4
# Groups:   site [?]
   site    ID duration av.individuals
  <dbl> <dbl> <time>            <dbl>
1     1    12 4                  0.75
2     1    16 13                 0   
3     1    26 24                 1.42
4     1    46 21                 1.62
5     1    89 9                  1.33
6     2    14 7                  1.14
7     2    18 21                 0.875
8     2    19 18                 0.333
9     2    39 14                 1.14

The final step is to add the required column to the original dataset (dat) again with left_join():

dat %>% left_join(newDat, by = c("site", "ID"))
dat
  site ID start.date   end.date   duration av.individuals
1    1 16 2017-05-24 2017-06-05    13 days       0.000000
2    1 46 2017-04-30 2017-05-20    21 days       1.619048
3    1 26 2017-04-30 2017-05-23    24 days       1.416667
4    1 89 2017-05-05 2017-05-13     9 days       2.333333
5    1 12 2017-05-11 2017-05-14     4 days       2.750000
6    2 14 2017-05-04 2017-05-10     7 days       1.142857
7    2 18 2017-05-09 2017-05-29    21 days       0.857143
8    2 19 2017-05-24 2017-06-10    18 days       0.333333
9    2 39 2017-05-05 2017-05-18    14 days       1.142857