There are many questions (and answers!) related to group_by and mutate such as (1,2,3), but I cannot find a solution to my problem. So hopefully it is not already mentioned somewhere, I am new to the community but I really looked hard before posting this.
I have a data frame similar to this:
Place=c(rep("Belgium",14),rep("France",9),rep("Spain",21))
Age=c(rep("Newborn",4),rep("Young",3),rep("Adult",7),rep("Newborn",2),rep("Young",3),rep("Adult",4),rep("Newborn",7),rep("Young",11),rep("Adult",3))
Date_sample=as.Date(c('2014-03-18','2014-04-01','2015-06-16','2015-07-21','2014-07-15','2014-08-26','2015-04-07','2014-06-12','2014-08-11','2014-12-08','2015-06-02','2015-07-03','2016-03-24','2016-06-22',
'2014-04-18','2014-05-01','2014-06-18','2014-07-25','2015-10-15','2014-07-26','2014-09-07','2015-06-03','2015-08-19','2014-03-11','2014-04-01','2014-04-15','2015-06-16','2015-07-15','2016-06-23','2016-10-01',
'2014-03-15','2014-04-12','2014-04-30','2015-04-13','2015-07-01','2015-09-01','2015-10-27','2016-05-03','2016-06-26','2016-07-21','2016-08-25',
'2014-04-15','2014-04-30','2015-04-12'))
Sampling=c("Initial","S1","Initial","S1","Initial","S1","Initial","Initial","S1","S2","Initial","S1","Initial","S1","Initial","S1","Initial","S1","Initial","Initial","S1","Initial","S1","Initial","S1","S2","Initial",
"S1","Initial","S1","Initial","S1","S2","Initial","S1","S2","S3","Initial","S1","S2","S3","Initial","S1","Initial")
total_1=c(300,289,299,299,200,197,200,100,100,75,94,94,100,100,296,296,200,202,200,100,100,100,98,300,300,300,300,300,299,290,200,200,199,199,210,210,200,200,199,196,196,100,100,100)
my_data<-data.table(Place,Age,Date_sample,Sampling,total_1)
So my_data looks like this:
> my_data
Place Age Date_sample Sampling total_1
1: Belgium Newborn 2014-03-18 Initial 300
2: Belgium Newborn 2014-04-01 S1 289
3: Belgium Newborn 2015-06-16 Initial 299
4: Belgium Newborn 2015-07-21 S1 299
5: Belgium Young 2014-07-15 Initial 200
6: Belgium Young 2014-08-26 S1 197
7: Belgium Young 2015-04-07 Initial 200
8: Belgium Adult 2014-06-12 Initial 100
9: Belgium Adult 2014-08-11 S1 100
10: Belgium Adult 2014-12-08 S2 75
11: Belgium Adult 2015-06-02 Initial 94
12: Belgium Adult 2015-07-03 S1 94
13: Belgium Adult 2016-03-24 Initial 100
14: Belgium Adult 2016-06-22 S1 100
15: France Newborn 2014-04-18 Initial 296
16: France Newborn 2014-05-01 S1 296
17: France Young 2014-06-18 Initial 200
18: France Young 2014-07-25 S1 202
19: France Young 2015-10-15 Initial 200
20: France Adult 2014-07-26 Initial 100
21: France Adult 2014-09-07 S1 100
22: France Adult 2015-06-03 Initial 100
23: France Adult 2015-08-19 S1 98
24: Spain Newborn 2014-03-11 Initial 300
25: Spain Newborn 2014-04-01 S1 300
26: Spain Newborn 2014-04-15 S2 300
27: Spain Newborn 2015-06-16 Initial 300
28: Spain Newborn 2015-07-15 S1 300
29: Spain Newborn 2016-06-23 Initial 299
30: Spain Newborn 2016-10-01 S1 290
31: Spain Young 2014-03-15 Initial 200
32: Spain Young 2014-04-12 S1 200
33: Spain Young 2014-04-30 S2 199
34: Spain Young 2015-04-13 Initial 199
35: Spain Young 2015-07-01 S1 210
36: Spain Young 2015-09-01 S2 210
37: Spain Young 2015-10-27 S3 200
38: Spain Young 2016-05-03 Initial 200
39: Spain Young 2016-06-26 S1 199
40: Spain Young 2016-07-21 S2 196
41: Spain Young 2016-08-25 S3 196
42: Spain Adult 2014-04-15 Initial 100
43: Spain Adult 2014-04-30 S1 100
44: Spain Adult 2015-04-12 Initial 100
I want to find the abnormal values: for each Place, Age class, and year, the total number of individual must remain equal over time. I want to use the dplyr syntax. So I tried using group_by and then I want to identify for each group the total_1 value of the initial sampling, and compare it to the other total_1 value of the group, and return NA if the values are identical, or "Yes" if the value is different (i.e. abnormal).
I have tried the following:
my_data <- my_data %>% mutate(Year=lubridate::year(Date_sample)) %>% relocate(Year, .after=Date_sample)
my_data <- my_data %>% group_by(Place, Age, Year) %>% mutate(
abnorm_1=ifelse(total_1==.[which("Sampling"=="Initial"),"total_1"],NA_integer_,'Yes'))
Isolating the year works, but the second "mutate" does not. I feel I am not identyfying the value well for the condition. I also tried abnorm_1=ifelse(total_1==total_1[1],NA_integer_,'Yes'
but did not get a better results. The error is:
Error: Problem with `mutate()` column `abnorm_1`.
i `abnorm_1 = ifelse(...)`.
x subscript out of bounds
i The error occurred in group 1: Place = "Belgium", Age = "Adult", Year = 2014.
Run `rlang::last_error()` to see where the error occurred.
The expected results is:
> my_data
Place Age Date_sample Sampling total_1 abnorm_1
1: Belgium Newborn 2014-03-18 Initial 300 NA
2: Belgium Newborn 2014-04-01 S1 289 Yes
3: Belgium Newborn 2015-06-16 Initial 299 NA
4: Belgium Newborn 2015-07-21 S1 299 NA
5: Belgium Young 2014-07-15 Initial 200 NA
6: Belgium Young 2014-08-26 S1 197 Yes
7: Belgium Young 2015-04-07 Initial 200 NA
8: Belgium Adult 2014-06-12 Initial 100 NA
9: Belgium Adult 2014-08-11 S1 100 NA
10: Belgium Adult 2014-12-08 S2 75 Yes
11: Belgium Adult 2015-06-02 Initial 94 NA
12: Belgium Adult 2015-07-03 S1 94 NA
13: Belgium Adult 2016-03-24 Initial 100 NA
14: Belgium Adult 2016-06-22 S1 100 NA
15: France Newborn 2014-04-18 Initial 296 NA
16: France Newborn 2014-05-01 S1 296 NA
17: France Young 2014-06-18 Initial 200 NA
18: France Young 2014-07-25 S1 202 Yes
19: France Young 2015-10-15 Initial 200 NA
20: France Adult 2014-07-26 Initial 100 NA
21: France Adult 2014-09-07 S1 100 NA
22: France Adult 2015-06-03 Initial 100 NA
23: France Adult 2015-08-19 S1 98 Yes
24: Spain Newborn 2014-03-11 Initial 300 NA
25: Spain Newborn 2014-04-01 S1 300 NA
26: Spain Newborn 2014-04-15 S2 300 NA
27: Spain Newborn 2015-06-16 Initial 300 NA
28: Spain Newborn 2015-07-15 S1 300 NA
29: Spain Newborn 2016-06-23 Initial 299 NA
30: Spain Newborn 2016-10-01 S1 290 Yes
31: Spain Young 2014-03-15 Initial 200 NA
32: Spain Young 2014-04-12 S1 200 NA
33: Spain Young 2014-04-30 S2 199 Yes
34: Spain Young 2015-04-13 Initial 199 NA
35: Spain Young 2015-07-01 S1 210 Yes
36: Spain Young 2015-09-01 S2 210 Yes
37: Spain Young 2015-10-27 S3 200 Yes
38: Spain Young 2016-05-03 Initial 200 NA
39: Spain Young 2016-06-26 S1 199 Yes
40: Spain Young 2016-07-21 S2 196 Yes
41: Spain Young 2016-08-25 S3 196 Yes
42: Spain Adult 2014-04-15 Initial 100 NA
43: Spain Adult 2014-04-30 S1 100 NA
44: Spain Adult 2015-04-12 Initial 100 NA
Can someone please help me fix this? Thank you for your help!
Edit following comments
As mentionned in the comment sections, fixing the NA
character/integer point made me observe something. One group of individuals can be followed over two years, so grouping by year induces incorrect information. Some of my lines look like this:
Place=rep("Spain",21)
Age=c(rep("Newborn",7),rep("Young",11),rep("Adult",3))
Date_sample=as.Date(c('2014-03-11','2014-04-01','2015-04-15','2015-06-16','2015-07-15','2016-06-23','2016-10-01','2014-03-15','2014-04-12',
'2015-01-30','2015-04-13','2015-07-01','2016-02-01','2016-02-20','2016-05-03','2016-06-26','2017-01-21','2017-03-01',
'2014-04-15','2014-04-30','2015-04-12'))
Sampling=c("Initial","S1","S2","Initial","S1","Initial","S1","Initial","S1","S2","Initial","S1","S2","S3","Initial","S1","S2","S3","Initial",
"S1","Initial")
total_1=c(300,300,297,300,300,299,290,200,200,199,199,210,210,200,200,199,196,196,100,100,100)
my_data<-data.table(Place,Age,Date_sample,Sampling,total_1)
my_data
Place Age Date_sample Sampling total_1
1: Spain Newborn 2014-03-11 Initial 300
2: Spain Newborn 2014-04-01 S1 300
3: Spain Newborn 2015-04-15 S2 297
4: Spain Newborn 2015-06-16 Initial 300
5: Spain Newborn 2015-07-15 S1 300
6: Spain Newborn 2016-06-23 Initial 299
7: Spain Newborn 2016-10-01 S1 290
8: Spain Young 2014-03-15 Initial 200
9: Spain Young 2014-04-12 S1 200
10: Spain Young 2015-01-30 S2 199
11: Spain Young 2015-04-13 Initial 199
12: Spain Young 2015-07-01 S1 210
13: Spain Young 2016-02-01 S2 210
14: Spain Young 2016-02-20 S3 200
15: Spain Young 2016-05-03 Initial 200
16: Spain Young 2016-06-26 S1 199
17: Spain Young 2017-01-21 S2 196
18: Spain Young 2017-03-01 S3 196
19: Spain Adult 2014-04-15 Initial 100
20: Spain Adult 2014-04-30 S1 100
21: Spain Adult 2015-04-12 Initial 100
If I could generate a new column to identify each group/experimental unit, I could then group_by this id and run the rest of the code. Is there a way, ideally not using a loop, to i) create a new column, ii) when sampling == initial, write 1 for that line in the new column and for each line until the next row with "initial" - 1. Then write 2 etc. The solution should look like this:
my_data
Place Age Date_sample Sampling exp_unit total_1 abnorm_1
1: Spain Newborn 2014-03-11 Initial 1 300 NA
2: Spain Newborn 2014-04-01 S1 1 300 NA
3: Spain Newborn 2015-04-15 S2 1 297 Yes
4: Spain Newborn 2015-06-16 Initial 2 300 NA
5: Spain Newborn 2015-07-15 S1 2 300 NA
6: Spain Newborn 2016-06-23 Initial 3 299 NA
7: Spain Newborn 2016-10-01 S1 3 290 Yes
8: Spain Young 2014-03-15 Initial 4 200 NA
9: Spain Young 2014-04-12 S1 4 200 NA
10: Spain Young 2015-01-30 S2 4 199 Yes
11: Spain Young 2015-04-13 Initial 5 199 NA
12: Spain Young 2015-07-01 S1 5 210 Yes
13: Spain Young 2016-02-01 S2 5 210 Yes
14: Spain Young 2016-02-20 S3 5 200 Yes
15: Spain Young 2016-05-03 Initial 6 200 NA
16: Spain Young 2016-06-26 S1 6 199 Yes
17: Spain Young 2017-01-21 S2 6 196 Yes
18: Spain Young 2017-03-01 S3 6 196 Yes
19: Spain Adult 2014-04-15 Initial 7 100 NA
20: Spain Adult 2014-04-30 S1 7 100 NA
21: Spain Adult 2015-04-12 Initial 8 100 NA
Thank you for your help!
abnorm_1=ifelse(total_1 == total_1[1], NA_character_, 'Yes')
, worked for me (although note I had to use a characterNA
instead of an integerNA
). – aosmithtotal_1 == first(total_1)
. – Martin GalNA_character_
so all output types match. FYI to make a comment back to folks who commented so they'll see it, use @ and their user name in the comment. :) – aosmith