Subsetting by data in two columns

Question

I have a dataset of the following structure

     site    block treatment date insect1 insect2 insect3 insect4 ...
1  location1     a  chemical1 date1  0     0      10       1          
2  location1     a  chemical2 date1  1     0       2       0   
3  location1     a  chemical3 date1  0     0      23       1   
4  location1     a  chemical4 date1  0     0       5       0   
5  location1     a  chemical5 date1  0     0       9       0   
6  location1     b  chemical1 date1  0     1       5       0   
7  location1     b  chemical2 date1  1     0       5       1   
8  location1     b  chemical3 date1  0     0       4       0   
9  location1     b  chemical4 date1  0     0       5       0   
10 location1     b  chemical5 date1  3     0      12       0   
11 location1     c  chemical1 date1  0     0       2       1   
12 location1     c  chemical2 date1  0     0       0       0   
13 location1     c  chemical3 date1  0     0       4       0   
14 location1     c  chemical4 date1  0     0       2       7   
15 location1     c  chemical5 date1  2     0       5       0   
16 location1     d  chemical1 date1  0     0       8       1   
17 location1     d  chemical2 date1  0     0       3       0   
18 location1     d  chemical3 date1  0     0      10       0   
19 location1     d  chemical4 date1  0     0       2       0   
20 location1     d  chemical5 date1  0     1       7       0
       .         .     .        .    .     .       .       .   
       .         .     .        .    .     .       .       .   
       .         .     .        .    .     .       .       .

This dataset is the result of an experiment that I conducted testing the effect of different five different chemical treatments (chemicals1-5) on the attraction of a number of different species of insects (insects1-4 here) at a field site (location1). This experiment was blocked (a-d) 4 times at different locations within that field site, and replicated 5 times at different dates (only date1 shown). All of this information is stored within my first four columns of my dataset.

The next series of columns (I have 46, but I'm only showing 4) indicate a different species of insect, and the number of insects that I captured with a specific chemical Within each treatment x block x date combination (=each row).

As part of my analysis, I want to run through this dataset and find the combinations of block x date for each insect where I caught no insects. For instance, I captured no individuals of insect2 in block a or c for date1, and thus I would like to drop that from my final dataset for analysis.

I have spent a lot of time working through code to accomplish this task, but I discovered last night that my code was not working as I had thought and I am at my wits end trying to figure it out. Here is the code so far (I have included all the steps to get to my problem, so folks can see where the issue may have been introduced, or to suggest a better way of doing things...):

Create a list such that each species of insect (columns 5-8 here) has its own dataframe

sticky.list = lapply(sticky[-c(1:4,50)], function(i)data.frame(site=sticky$site, 
                                                          block=sticky$block,
                                                          treatment=sticky$treatment,
                                                          date=sticky$date,
                                                          number=as.numeric(i)))

Example of part of one of the dataframes that was created as part of my list

$insect1
       site    block     treatment date     number
1  location1     a       chemical1 date1      0
2  location1     a       chemical2 date1      1
3  location1     a       chemical3 date1      0
4  location1     a       chemical4 date1      0
5  location1     a       chemical5 date1      0

Then Add a new column within each dataframe in the list that has the dataframe name (i.e., insect name)

temp.list = Map(cbind, sticky.list, morphotype = names(sticky.list))  

       site    block   treatment date     number morphotype
1  location1     a     chemical1 date1      0      insect1
2  location1     a     chemical2 date1      1      insect1      
3  location1     a     chemical3 date1      0      insect1
4  location1     a     chemical4 date1      0      insect1
5  location1     a     chemical5 date1      0      insect1

Make a larger dataset by combining vertically, then flatten each list element (i.e., make one big dataframe. This puts all of the dataframes from my previous list together in one dataframe.

sticky.list.combined.df <- temp.list %>% bind_rows(temp.list) %>% # make larger sample data
  mutate_if(is.list, simplify_all) %>% # flatten each list element internally 
  unnest()

Group by block and morphotype and find sum of number based on this grouping. Then, add this sum column to the main large dataframe that we just created, i.e. sticky.list.combined.df, using an inner join.

sticky.list.combined.df.sum<- sticky.list.combined.df %>%
  group_by(date, block, morphotype) %>%
  summarize(sum = sum(number))

# A tibble: 855 x 4
# Groups:   date, block [?]
   date            block morphotype    sum
   <fct>           <fct> <chr>       <dbl>
 1 date1 a     insect1     0
 2 date1 a     insect2     0
 3 date1 a     insect3     0
 4 date1 a     insect4     0
# … with 845 more rows

then

sticky.list.analysis<-left_join(sticky.list.combined.df,sticky.list.combined.df.sum, by=c("date"="date",
                                                                                          "morphotype"="morphotype"))

This is an example of the output showing only insect1. The deciding factors whether to keep the 5 rows for each block.x are the last two columns, block.y and sum, which indicate the sum of all of the insects captured for chemicals1-5 for each block (a-d).

      site       block.x    treatment date    number     morphotype block.y sum
1   location1       a       chemical1 date1      0         insect1       a   2
2   location1       a       chemical1 date1      0         insect1       b   8
3   location1       a       chemical1 date1      0         insect1       c   4
4   location1       a       chemical1 date1      0         insect1       d   0
5   location1       a       chemical2 date1      0         insect1       a   2
6   location1       a       chemical2 date1      0         insect1       b   8
7   location1       a       chemical2 date1      0         insect1       c   4
8   location1       a       chemical2 date1      0         insect1       d   0
9   location1       a       chemical3 date1      0         insect1       a   2
10  location1       a       chemical3 date1      0         insect1       b   8
11  location1       a       chemical3 date1      0         insect1       c   4
12  location1       a       chemical3 date1      0         insect1       d   0
13  location1       a       chemical4 date1      0         insect1       a   2
14  location1       a       chemical4 date1      0         insect1       b   8
15  location1       a       chemical4 date1      0         insect1       c   4
16  location1       a       chemical4 date1      0         insect1       d   0
17  location1       a       chemical5 date1      0         insect1       a   2
18  location1       a       chemical5 date1      0         insect1       b   8
19  location1       a       chemical5 date1      0         insect1       c   4
20  location1       a       chemical5 date1      0         insect1       d   0

This is where I believe the issue I'm having arises

Filter rows with sum > 0.

For each combination of trapping date (e.g., date1) and morphotype, remove the rows (i.e., blocks a-d) that have zero captures of a morphotype in that block. It is typical in trapping experiments (and common in Hanks lab statistical practices) to drop or not include dates that have no captures of our target insect. This could be related to abiotic factors (e.g., too cold/hot, rain) or phenological factors associated with the insect. Leaving these zeroes in our data reduces our chances of finding significant effects in our data, thus we will exclude them.

sticky.list.analysis.reduced<- sticky.list.analysis %>% 
  filter(sum > 0)

The shortened output below shows that for insect1 we should be keeping blocks a-c. Which blocks are kept will vary based upon which insect were are looking at. What I want to do is now take this data from block.y and use it to drop rows for those blocks.

Unfortunately, this is not my desired output. R has dropped one row in based upon the sum column. We now see that block d was dropped according to the block.y column. Unfortunately we needed to drop rows 46-60.

Output:

       site block.x treatment date number morphotype block.y sum
1    location1   a    chemical1 date1   0      insect1    a   2
2    location1   a    chemical1 date1   0      insect1    b   8
3    location1   a    chemical1 date1   0      insect1    c   4
4    location1   a    chemical2 date1   0      insect1    a   2
5    location1   a    chemical2 date1   0      insect1    b   8
6    location1   a    chemical2 date1   0      insect1    c   4
7    location1   a    chemical3 date1   0      insect1    a   2
8    location1   a    chemical3 date1   0      insect1    b   8
9    location1   a    chemical3 date1   0      insect1    c   4
10   location1   a    chemical4 date1   0      insect1    a   2
11   location1   a    chemical4 date1   0      insect1    b   8
12   location1   a    chemical4 date1   0      insect1    c   4
13   location1   a    chemical5 date1   0      insect1    a   2
14   location1   a    chemical5 date1   0      insect1    b   8
15   location1   a    chemical5 date1   0      insect1    c   4
16   location1   b    chemical1 date1   0      insect1    a   2
17   location1   b    chemical1 date1   0      insect1    b   8
18   location1   b    chemical1 date1   0      insect1    c   4
19   location1   b    chemical2 date1   0      insect1    a   2
20   location1   b    chemical2 date1   0      insect1    b   8
21   location1   b    chemical2 date1   0      insect1    c   4
22   location1   b    chemical3 date1   0      insect1    a   2
23   location1   b    chemical3 date1   0      insect1    b   8
24   location1   b    chemical3 date1   0      insect1    c   4
25   location1   b    chemical4 date1   0      insect1    a   2
26   location1   b    chemical4 date1   0      insect1    b   8
27   location1   b    chemical4 date1   0      insect1    c   4
28   location1   b    chemical5 date1   0      insect1    a   2
29   location1   b    chemical5 date1   0      insect1    b   8
30   location1   b    chemical5 date1   0      insect1    c   4
31   location1   c    chemical1 date1   0      insect1    a   2
32   location1   c    chemical1 date1   0      insect1    b   8
33   location1   c    chemical1 date1   0      insect1    c   4
34   location1   c    chemical2 date1   0      insect1    a   2
35   location1   c    chemical2 date1   0      insect1    b   8
36   location1   c    chemical2 date1   0      insect1    c   4
37   location1   c    chemical3 date1   0      insect1    a   2
38   location1   c    chemical3 date1   0      insect1    b   8
39   location1   c    chemical3 date1   0      insect1    c   4
40   location1   c    chemical4 date1   0      insect1    a   2
41   location1   c    chemical4 date1   0      insect1    b   8
42   location1   c    chemical4 date1   0      insect1    c   4
43   location1   c    chemical5 date1   0      insect1    a   2
44   location1   c    chemical5 date1   0      insect1    b   8
45   location1   c    chemical5 date1   0      insect1    c   4
46   location1   d    chemical1 date1   0      insect1    a   2
47   location1   d    chemical1 date1   0      insect1    b   8
48   location1   d    chemical1 date1   0      insect1    c   4
49   location1   d    chemical2 date1   0      insect1    a   2
50   location1   d    chemical2 date1   0      insect1    b   8
51   location1   d    chemical2 date1   0      insect1    c   4
52   location1   d    chemical3 date1   0      insect1    a   2
53   location1   d    chemical3 date1   0      insect1    b   8
54   location1   d    chemical3 date1   0      insect1    c   4
55   location1   d    chemical4 date1   0      insect1    a   2
56   location1   d    chemical4 date1   0      insect1    b   8
57   location1   d    chemical4 date1   0      insect1    c   4
58   location1   d    chemical5 date1   0      insect1    a   2
59   location1   d    chemical5 date1   0      insect1    b   8
60   location1   d    chemical5 date1   0      insect1    c   4

Desired output:

       site block.x treatment date number morphotype block.y sum
1    location1   a    chemical1 date1   0      insect1    a   2
2    location1   a    chemical1 date1   0      insect1    b   8
3    location1   a    chemical1 date1   0      insect1    c   4
4    location1   a    chemical2 date1   0      insect1    a   2
5    location1   a    chemical2 date1   0      insect1    b   8
6    location1   a    chemical2 date1   0      insect1    c   4
7    location1   a    chemical3 date1   0      insect1    a   2
8    location1   a    chemical3 date1   0      insect1    b   8
9    location1   a    chemical3 date1   0      insect1    c   4
10   location1   a    chemical4 date1   0      insect1    a   2
11   location1   a    chemical4 date1   0      insect1    b   8
12   location1   a    chemical4 date1   0      insect1    c   4
13   location1   a    chemical5 date1   0      insect1    a   2
14   location1   a    chemical5 date1   0      insect1    b   8
15   location1   a    chemical5 date1   0      insect1    c   4
16   location1   b    chemical1 date1   0      insect1    a   2
17   location1   b    chemical1 date1   0      insect1    b   8
18   location1   b    chemical1 date1   0      insect1    c   4
19   location1   b    chemical2 date1   0      insect1    a   2
20   location1   b    chemical2 date1   0      insect1    b   8
21   location1   b    chemical2 date1   0      insect1    c   4
22   location1   b    chemical3 date1   0      insect1    a   2
23   location1   b    chemical3 date1   0      insect1    b   8
24   location1   b    chemical3 date1   0      insect1    c   4
25   location1   b    chemical4 date1   0      insect1    a   2
26   location1   b    chemical4 date1   0      insect1    b   8
27   location1   b    chemical4 date1   0      insect1    c   4
28   location1   b    chemical5 date1   0      insect1    a   2
29   location1   b    chemical5 date1   0      insect1    b   8
30   location1   b    chemical5 date1   0      insect1    c   4
31   location1   c    chemical1 date1   0      insect1    a   2
32   location1   c    chemical1 date1   0      insect1    b   8
33   location1   c    chemical1 date1   0      insect1    c   4
34   location1   c    chemical2 date1   0      insect1    a   2
35   location1   c    chemical2 date1   0      insect1    b   8
36   location1   c    chemical2 date1   0      insect1    c   4
37   location1   c    chemical3 date1   0      insect1    a   2
38   location1   c    chemical3 date1   0      insect1    b   8
39   location1   c    chemical3 date1   0      insect1    c   4
40   location1   c    chemical4 date1   0      insect1    a   2
41   location1   c    chemical4 date1   0      insect1    b   8
42   location1   c    chemical4 date1   0      insect1    c   4
43   location1   c    chemical5 date1   0      insect1    a   2
44   location1   c    chemical5 date1   0      insect1    b   8
45   location1   c    chemical5 date1   0      insect1    c   4

Once this problem is resolved, I'd like to subset each insect from its column (I know how to do this manually, but not for all insect species, but that is a whole different question) then run generalized linear mixed models to assess the effect of treatment on the capture of each insect, with date and location as random effects.

I appreciate any insight into this matter. If I need to edit this to add any additional information, please let me know, I've done my best to make the structure of my data and problem clear. Thank you.

Iago Carvalho Iago Carvalho · Accepted Answer · 2019-02-20T17:00:09

have you tried the subset function? It is defined under the base R package (link).

You can do the following:

filtered.sticky.list.analysis <- subset(sticky.list.analysis, block.x == "a" || block.x == "b" || block.x == "c")

Another thing that can work is:

filtered.sticky.list.analysis <- subset(sticky.list.analysis, block.x != "d")

The codes are pretty clear. The first option selects everything where block.x is equal to a, b, or c. The second option selects everything that is different from d.

Subsetting by data in two columns

1 Answers