2
votes

There are many questions (and answers!) related to group_by and mutate such as (1,2,3), but I cannot find a solution to my problem. So hopefully it is not already mentioned somewhere, I am new to the community but I really looked hard before posting this.

I have a data frame similar to this:

Place=c(rep("Belgium",14),rep("France",9),rep("Spain",21))
Age=c(rep("Newborn",4),rep("Young",3),rep("Adult",7),rep("Newborn",2),rep("Young",3),rep("Adult",4),rep("Newborn",7),rep("Young",11),rep("Adult",3))
Date_sample=as.Date(c('2014-03-18','2014-04-01','2015-06-16','2015-07-21','2014-07-15','2014-08-26','2015-04-07','2014-06-12','2014-08-11','2014-12-08','2015-06-02','2015-07-03','2016-03-24','2016-06-22',
          '2014-04-18','2014-05-01','2014-06-18','2014-07-25','2015-10-15','2014-07-26','2014-09-07','2015-06-03','2015-08-19','2014-03-11','2014-04-01','2014-04-15','2015-06-16','2015-07-15','2016-06-23','2016-10-01',
          '2014-03-15','2014-04-12','2014-04-30','2015-04-13','2015-07-01','2015-09-01','2015-10-27','2016-05-03','2016-06-26','2016-07-21','2016-08-25',
          '2014-04-15','2014-04-30','2015-04-12'))
Sampling=c("Initial","S1","Initial","S1","Initial","S1","Initial","Initial","S1","S2","Initial","S1","Initial","S1","Initial","S1","Initial","S1","Initial","Initial","S1","Initial","S1","Initial","S1","S2","Initial",
           "S1","Initial","S1","Initial","S1","S2","Initial","S1","S2","S3","Initial","S1","S2","S3","Initial","S1","Initial")
total_1=c(300,289,299,299,200,197,200,100,100,75,94,94,100,100,296,296,200,202,200,100,100,100,98,300,300,300,300,300,299,290,200,200,199,199,210,210,200,200,199,196,196,100,100,100)
my_data<-data.table(Place,Age,Date_sample,Sampling,total_1)

So my_data looks like this:

> my_data
      Place     Age Date_sample Sampling total_1
 1: Belgium Newborn  2014-03-18  Initial     300
 2: Belgium Newborn  2014-04-01       S1     289
 3: Belgium Newborn  2015-06-16  Initial     299
 4: Belgium Newborn  2015-07-21       S1     299
 5: Belgium   Young  2014-07-15  Initial     200
 6: Belgium   Young  2014-08-26       S1     197
 7: Belgium   Young  2015-04-07  Initial     200
 8: Belgium   Adult  2014-06-12  Initial     100
 9: Belgium   Adult  2014-08-11       S1     100
10: Belgium   Adult  2014-12-08       S2      75
11: Belgium   Adult  2015-06-02  Initial      94
12: Belgium   Adult  2015-07-03       S1      94
13: Belgium   Adult  2016-03-24  Initial     100
14: Belgium   Adult  2016-06-22       S1     100
15:  France Newborn  2014-04-18  Initial     296
16:  France Newborn  2014-05-01       S1     296
17:  France   Young  2014-06-18  Initial     200
18:  France   Young  2014-07-25       S1     202
19:  France   Young  2015-10-15  Initial     200
20:  France   Adult  2014-07-26  Initial     100
21:  France   Adult  2014-09-07       S1     100
22:  France   Adult  2015-06-03  Initial     100
23:  France   Adult  2015-08-19       S1      98
24:   Spain Newborn  2014-03-11  Initial     300
25:   Spain Newborn  2014-04-01       S1     300
26:   Spain Newborn  2014-04-15       S2     300
27:   Spain Newborn  2015-06-16  Initial     300
28:   Spain Newborn  2015-07-15       S1     300
29:   Spain Newborn  2016-06-23  Initial     299
30:   Spain Newborn  2016-10-01       S1     290
31:   Spain   Young  2014-03-15  Initial     200
32:   Spain   Young  2014-04-12       S1     200
33:   Spain   Young  2014-04-30       S2     199
34:   Spain   Young  2015-04-13  Initial     199
35:   Spain   Young  2015-07-01       S1     210
36:   Spain   Young  2015-09-01       S2     210
37:   Spain   Young  2015-10-27       S3     200
38:   Spain   Young  2016-05-03  Initial     200
39:   Spain   Young  2016-06-26       S1     199
40:   Spain   Young  2016-07-21       S2     196
41:   Spain   Young  2016-08-25       S3     196
42:   Spain   Adult  2014-04-15  Initial     100
43:   Spain   Adult  2014-04-30       S1     100
44:   Spain   Adult  2015-04-12  Initial     100

I want to find the abnormal values: for each Place, Age class, and year, the total number of individual must remain equal over time. I want to use the dplyr syntax. So I tried using group_by and then I want to identify for each group the total_1 value of the initial sampling, and compare it to the other total_1 value of the group, and return NA if the values are identical, or "Yes" if the value is different (i.e. abnormal).

I have tried the following:

my_data <- my_data %>% mutate(Year=lubridate::year(Date_sample)) %>% relocate(Year, .after=Date_sample)

my_data <- my_data %>% group_by(Place, Age, Year) %>% mutate(
            abnorm_1=ifelse(total_1==.[which("Sampling"=="Initial"),"total_1"],NA_integer_,'Yes'))

Isolating the year works, but the second "mutate" does not. I feel I am not identyfying the value well for the condition. I also tried abnorm_1=ifelse(total_1==total_1[1],NA_integer_,'Yes' but did not get a better results. The error is:

Error: Problem with `mutate()` column `abnorm_1`.
i `abnorm_1 = ifelse(...)`.
x subscript out of bounds
i The error occurred in group 1: Place = "Belgium", Age = "Adult", Year = 2014.
Run `rlang::last_error()` to see where the error occurred. 

The expected results is:

> my_data
      Place     Age Date_sample Sampling total_1 abnorm_1
 1: Belgium Newborn  2014-03-18  Initial     300       NA
 2: Belgium Newborn  2014-04-01       S1     289      Yes
 3: Belgium Newborn  2015-06-16  Initial     299       NA
 4: Belgium Newborn  2015-07-21       S1     299       NA
 5: Belgium   Young  2014-07-15  Initial     200       NA
 6: Belgium   Young  2014-08-26       S1     197      Yes
 7: Belgium   Young  2015-04-07  Initial     200       NA
 8: Belgium   Adult  2014-06-12  Initial     100       NA
 9: Belgium   Adult  2014-08-11       S1     100       NA
10: Belgium   Adult  2014-12-08       S2      75      Yes
11: Belgium   Adult  2015-06-02  Initial      94       NA
12: Belgium   Adult  2015-07-03       S1      94       NA
13: Belgium   Adult  2016-03-24  Initial     100       NA
14: Belgium   Adult  2016-06-22       S1     100       NA
15:  France Newborn  2014-04-18  Initial     296       NA
16:  France Newborn  2014-05-01       S1     296       NA
17:  France   Young  2014-06-18  Initial     200       NA
18:  France   Young  2014-07-25       S1     202      Yes
19:  France   Young  2015-10-15  Initial     200       NA
20:  France   Adult  2014-07-26  Initial     100       NA
21:  France   Adult  2014-09-07       S1     100       NA
22:  France   Adult  2015-06-03  Initial     100       NA
23:  France   Adult  2015-08-19       S1      98      Yes
24:   Spain Newborn  2014-03-11  Initial     300       NA
25:   Spain Newborn  2014-04-01       S1     300       NA
26:   Spain Newborn  2014-04-15       S2     300       NA
27:   Spain Newborn  2015-06-16  Initial     300       NA
28:   Spain Newborn  2015-07-15       S1     300       NA
29:   Spain Newborn  2016-06-23  Initial     299       NA
30:   Spain Newborn  2016-10-01       S1     290      Yes
31:   Spain   Young  2014-03-15  Initial     200       NA
32:   Spain   Young  2014-04-12       S1     200       NA
33:   Spain   Young  2014-04-30       S2     199      Yes
34:   Spain   Young  2015-04-13  Initial     199       NA
35:   Spain   Young  2015-07-01       S1     210      Yes
36:   Spain   Young  2015-09-01       S2     210      Yes
37:   Spain   Young  2015-10-27       S3     200      Yes
38:   Spain   Young  2016-05-03  Initial     200       NA
39:   Spain   Young  2016-06-26       S1     199      Yes
40:   Spain   Young  2016-07-21       S2     196      Yes
41:   Spain   Young  2016-08-25       S3     196      Yes
42:   Spain   Adult  2014-04-15  Initial     100       NA
43:   Spain   Adult  2014-04-30       S1     100       NA
44:   Spain   Adult  2015-04-12  Initial     100       NA

Can someone please help me fix this? Thank you for your help!

Edit following comments As mentionned in the comment sections, fixing the NA character/integer point made me observe something. One group of individuals can be followed over two years, so grouping by year induces incorrect information. Some of my lines look like this:

Place=rep("Spain",21)
Age=c(rep("Newborn",7),rep("Young",11),rep("Adult",3))
Date_sample=as.Date(c('2014-03-11','2014-04-01','2015-04-15','2015-06-16','2015-07-15','2016-06-23','2016-10-01','2014-03-15','2014-04-12',
                      '2015-01-30','2015-04-13','2015-07-01','2016-02-01','2016-02-20','2016-05-03','2016-06-26','2017-01-21','2017-03-01',
                      '2014-04-15','2014-04-30','2015-04-12'))
Sampling=c("Initial","S1","S2","Initial","S1","Initial","S1","Initial","S1","S2","Initial","S1","S2","S3","Initial","S1","S2","S3","Initial",
           "S1","Initial")
total_1=c(300,300,297,300,300,299,290,200,200,199,199,210,210,200,200,199,196,196,100,100,100)
my_data<-data.table(Place,Age,Date_sample,Sampling,total_1)
my_data
    Place     Age Date_sample Sampling total_1
 1: Spain Newborn  2014-03-11  Initial     300
 2: Spain Newborn  2014-04-01       S1     300
 3: Spain Newborn  2015-04-15       S2     297
 4: Spain Newborn  2015-06-16  Initial     300
 5: Spain Newborn  2015-07-15       S1     300
 6: Spain Newborn  2016-06-23  Initial     299
 7: Spain Newborn  2016-10-01       S1     290
 8: Spain   Young  2014-03-15  Initial     200
 9: Spain   Young  2014-04-12       S1     200
10: Spain   Young  2015-01-30       S2     199
11: Spain   Young  2015-04-13  Initial     199
12: Spain   Young  2015-07-01       S1     210
13: Spain   Young  2016-02-01       S2     210
14: Spain   Young  2016-02-20       S3     200
15: Spain   Young  2016-05-03  Initial     200
16: Spain   Young  2016-06-26       S1     199
17: Spain   Young  2017-01-21       S2     196
18: Spain   Young  2017-03-01       S3     196
19: Spain   Adult  2014-04-15  Initial     100
20: Spain   Adult  2014-04-30       S1     100
21: Spain   Adult  2015-04-12  Initial     100

If I could generate a new column to identify each group/experimental unit, I could then group_by this id and run the rest of the code. Is there a way, ideally not using a loop, to i) create a new column, ii) when sampling == initial, write 1 for that line in the new column and for each line until the next row with "initial" - 1. Then write 2 etc. The solution should look like this:

my_data
    Place     Age Date_sample Sampling exp_unit total_1 abnorm_1
 1: Spain Newborn  2014-03-11  Initial        1     300       NA
 2: Spain Newborn  2014-04-01       S1        1     300       NA
 3: Spain Newborn  2015-04-15       S2        1     297      Yes
 4: Spain Newborn  2015-06-16  Initial        2     300       NA
 5: Spain Newborn  2015-07-15       S1        2     300       NA
 6: Spain Newborn  2016-06-23  Initial        3     299       NA
 7: Spain Newborn  2016-10-01       S1        3     290      Yes
 8: Spain   Young  2014-03-15  Initial        4     200       NA
 9: Spain   Young  2014-04-12       S1        4     200       NA
10: Spain   Young  2015-01-30       S2        4     199      Yes
11: Spain   Young  2015-04-13  Initial        5     199       NA
12: Spain   Young  2015-07-01       S1        5     210      Yes
13: Spain   Young  2016-02-01       S2        5     210      Yes
14: Spain   Young  2016-02-20       S3        5     200      Yes
15: Spain   Young  2016-05-03  Initial        6     200       NA
16: Spain   Young  2016-06-26       S1        6     199      Yes
17: Spain   Young  2017-01-21       S2        6     196      Yes
18: Spain   Young  2017-03-01       S3        6     196      Yes
19: Spain   Adult  2014-04-15  Initial        7     100       NA
20: Spain   Adult  2014-04-30       S1        7     100       NA
21: Spain   Adult  2015-04-12  Initial        8     100       NA

Thank you for your help!

1
Your second code attempt, abnorm_1=ifelse(total_1 == total_1[1], NA_character_, 'Yes'), worked for me (although note I had to use a character NA instead of an integer NA).aosmith
Or you could use total_1 == first(total_1).Martin Gal
Thank you, NA_character_ indeed works! Any idea what makes the difference between a character and an integer NA here? Also, now that the second attempt works, I realize (looking at the results) that I have one more issue: one sampling unit, for example of class Young, can start for example in March 2014, and be followed-up until Feb 2015. Then, another unit of class Young will be put in place in April 2015 and last until December 2015. I feel the first attempt could help fix this by using the "initial" status. Any idea what doesn't work there? Thank you for your help!Mata
An integer is like a numeric variable and a character is a categorical variable (usually a string, surrounded by quotes). The tidyverse is pretty strict with NA types, so since you return a character ("Yes") you also needed NA_character_ so all output types match. FYI to make a comment back to folks who commented so they'll see it, use @ and their user name in the comment. :)aosmith

1 Answers

0
votes

It sounds like for each Place and Age you need to create groups every time Sampling is "Initial" again.

You can make groups like this with cumsum(). Note this is all contingent on the dataset being in the correct order.

First an example of making a new grouping variable with your Spain dataset.

# Making groups based on "Initial"
my_data %>% 
    group_by(Place, Age) %>% 
    mutate(group = cumsum(Sampling == "Initial"))
#> # A tibble: 21 x 6
#> # Groups:   Place, Age [3]
#>    Place Age     Date_sample Sampling total_1 group
#>    <chr> <chr>   <chr>       <chr>      <dbl> <int>
#>  1 Spain Newborn 2014-03-11  Initial      300     1
#>  2 Spain Newborn 2014-04-01  S1           300     1
#>  3 Spain Newborn 2015-04-15  S2           297     1
#>  4 Spain Newborn 2015-06-16  Initial      300     2
#>  5 Spain Newborn 2015-07-15  S1           300     2
#>  6 Spain Newborn 2016-06-23  Initial      299     3
#>  7 Spain Newborn 2016-10-01  S1           290     3
#>  8 Spain Young   2014-03-15  Initial      200     1
#>  9 Spain Young   2014-04-12  S1           200     1
#> 10 Spain Young   2015-01-30  S2           199     1
#> # ... with 11 more rows

Now you can group by this variable and do the replacement the way you worked out already. Remove the group variable after ungrouping, as needed.

my_data %>% 
    group_by(Place, Age) %>% 
    mutate(group = cumsum(Sampling == "Initial")) %>%
    group_by(group, .add = TRUE) %>%
    mutate(
        abnorm_1 = ifelse(total_1 == total_1[1], NA_character_, 'Yes')) %>%
    ungroup()
#> # A tibble: 21 x 7
#>    Place Age     Date_sample Sampling total_1 group abnorm_1
#>    <chr> <chr>   <chr>       <chr>      <dbl> <int> <chr>   
#>  1 Spain Newborn 2014-03-11  Initial      300     1 NA      
#>  2 Spain Newborn 2014-04-01  S1           300     1 NA      
#>  3 Spain Newborn 2015-04-15  S2           297     1 Yes     
#>  4 Spain Newborn 2015-06-16  Initial      300     2 NA      
#>  5 Spain Newborn 2015-07-15  S1           300     2 NA      
#>  6 Spain Newborn 2016-06-23  Initial      299     3 NA      
#>  7 Spain Newborn 2016-10-01  S1           290     3 Yes     
#>  8 Spain Young   2014-03-15  Initial      200     1 NA      
#>  9 Spain Young   2014-04-12  S1           200     1 NA      
#> 10 Spain Young   2015-01-30  S2           199     1 Yes     
#> # ... with 11 more rows

Created on 2021-08-05 by the reprex package (v2.0.0)