1
votes

Does anyone know if it is possible to use a variable in one dataframe (in my case the "deploy" dataframe) to create a variable in another dataframe?

For example, I have two dataframes:

df1:

   deploy <- data.frame(ID = c("20180101_HH1_1_1", "20180101_HH1_1_2", "20180101_HH1_1_3"), 
             Site_Depth = c(42, 93, 40), Num_Depth_Bins_Required = c(5, 100, 4), 
             Percent_Column_in_each_bin = c(20, 10, 25))

df2:

   sp.c <- data.frame(species = c("RR", "GS", "GT", "BR", "RS", "BA", "GS", "RS", "SH", "RR"), 
                      ct = c(25, 66, 1, 12, 30, 6, 1, 22, 500, 6), 
                      percent_dist_from_surf = c(11, 15, 33, 68, 71, 100, 2, 65, 5, 42))

I want to create new columns in df2 that assigns each species and count to a bin based on the Percent_Column_in_each_bin for each ID. For example, in 20180101_HH1_1_3 there would be 4 bins that each make up 25% of the column and all species that are within 0-25% of the column (in df2) would be in bin 1 and species within 25-50% of the column would be in depth bin 2, and so on. What I'm imagining this looking like is:

    i.want.this <- data.frame(species = c("RR", "GS", "GT", "BR", "RS", "BA", "GS", "RS", "SH", "RR"), 
                      ct = c(25, 66, 1, 12, 30, 6, 1, 22, 500, 6), 
                      percent_dist_from_surf = c(11, 15, 33, 68, 71, 100, 2, 65, 5, 42),
                      '20180101_HH1_1_1_Bin' = c(1, 1, 2, 4, 4, 5, 1, 4, 1, 3),
                      '20180101_HH1_1_2_Bin' = c(2, 2, 4, 7, 8, 10, 1, 7, 1, 5), 
                      '20180101_HH1_1_3_Bin' = c(1, 1,  2, 3, 3, 4, 1, 3, 1, 2))

I am pretty new to R and I'm not sure how to make this happen. I need to do this for over 100 IDs (all with different depths, number of depth bins, and percent of the column in each bin) so I was hoping that I don't need to do them all by hand. I have tried mutate in dplyr but I can't get it to pull from two different dataframes. I have also tried ifelse statements, but I would need to run the ifelse statement for each ID individually.

I don't know if what I am trying to do is possible but I appreciate the feedback. Thank you in advance!

Edit: my end goal is to find the max count (max ct) for each species within each bin for each ID. What I've been doing to find this (using the bins generated with suggestions from @Ben) is using dplyr to slice and find the max ID like this:

    20180101_HH1_1_1 <- sp.c %>%
                        group_by(20180101_HH1_1_1, species) %>%
                        arrange(desc(ct)) %>% 
                        slice(1) %>%
                        group_by(20180101_HH1_1_1) %>%
                        mutate(Count_Total_Per_Bin = sum(ct)) %>%
                        group_by(species, add=TRUE) %>% 
                        mutate(species_percent_of_total_in_bin = 
                             paste0((100*ct/Count_Total_Per_Bin) %>%
                        mutate(ID= "20180101_HH1_1_1 ") %>%
                        ungroup()

but I have to do this for over 100 IDs. My desired output would be something like:

    end.goal <- data.frame(ID = c(rep("20180101_HH1_1_1", 8)),
                   species = c("RR", "GS", "SH", "GT", "RR", "BR", "RS", "BA"),
                   bin = c(1, 1, 1, 2, 3, 4, 4, 5),
                   Max_count_of_each_species_in_each_bin = c(11, 66, 500, 1, 6, 12, 30, 6),
                   percent_dist_from_surf = c(11, 15, 5, 33, 42, 68, 71, 100),
                   percent_each_species_max_in_each_bin = c((11/577)*100, (66/577)*100, (500/577)*100, 100, 100, (12/42)*100, (30/42)*100, 100))

I was thinking that by answering the original question I could get to this but I see now that there's still a lot you have to do to get this for each ID.

2

2 Answers

0
votes

It is helpful to distinguish between the contents of your two dataframes.

  • df2 appears to contain measurements from some sites
  • df1 appears to contain parameters by which you want to process/summarise the measurements in df2

Given these different purposes of the two dataframes, your best approach is probably to loop over all the rows of df1 each time adding a column to df2. Something like the following:

max_dist = max(df2$percent_dist_from_surf)

for(ii in 1:nrow(df1)){

  # extract parameters
  this_ID = df1[[ii,"ID"]]
  this_depth = df1[[ii,"Site_Depth"]]
  this_bins = df1[[ii,"Num_Depth_Bins_Required"]]
  this_percent = df1[[ii,"Percent_Column_in_each_bin"]]

  # add column to df2
  df2 = df2 %>%
    mutate(!!sym(this_ID) := insert_your_calculation_here)
}

The !!sym(this_ID) := part of the code is to allow dynamic naming of your output columns.

And as best I can determine the formula you want for insert_your_calculation_here is ceil(percent_dist_from_surf / max_dist * this_bins)

0
votes

Here is another approach, which does not require a loop.

Using sapply you can cut to determine bins for each percent_dist_from_surf value in your deploy dataframe.

res <- sapply(deploy$Percent_Column_in_each_bin, function(x) { 
  cut(sp.c$percent_dist_from_surf, seq(0, 100, by = x), include.lowest = TRUE, labels = 1:(100/x))
})
colnames(res) <- deploy$ID
cbind(sp.c, res)

Or using purrr:

library(purrr)

cbind(sp.c, imap(setNames(deploy$Percent_Column_in_each_bin, deploy$ID), 
     ~ cut(sp.c$percent_dist_from_surf, seq(0, 100, by = .x), include.lowest = TRUE, labels = 1:(100/.x))
))

Output

   species  ct percent_dist_from_surf 20180101_HH1_1_1 20180101_HH1_1_2 20180101_HH1_1_3
1       RR  25                     11                1                2                1
2       GS  66                     15                1                2                1
3       GT   1                     33                2                4                2
4       BR  12                     68                4                7                3
5       RS  30                     71                4                8                3
6       BA   6                    100                5               10                4
7       GS   1                      2                1                1                1
8       RS  22                     65                4                7                3
9       SH 500                      5                1                1                1
10      RR   6                     42                3                5                2

Edit:

To determine the maximum ct value for each species, site, and bin, put the result of above into a dataframe called res and do the following.

First would put into long form with pivot_longer. Then you can group_by species, site, and bin, and determine the maximum ct for this combination.

library(tidyverse)

res %>%
  pivot_longer(cols = starts_with("2018"), names_to = "site", values_to = "bin") %>%
  group_by(species, site, bin) %>%
  summarise(max_ct = max(ct)) %>%
  arrange(site, bin)

Output

# A tibble: 26 x 4
# Groups:   species, site [21]
   species site             bin   max_ct
   <fct>   <chr>            <fct>  <dbl>
 1 GS      20180101_HH1_1_1 1         66
 2 RR      20180101_HH1_1_1 1         25
 3 SH      20180101_HH1_1_1 1        500
 4 GT      20180101_HH1_1_1 2          1
 5 RR      20180101_HH1_1_1 3          6
 6 BR      20180101_HH1_1_1 4         12
 7 RS      20180101_HH1_1_1 4         30
 8 BA      20180101_HH1_1_1 5          6
 9 GS      20180101_HH1_1_2 1          1
10 SH      20180101_HH1_1_2 1        500
11 GS      20180101_HH1_1_2 2         66
12 RR      20180101_HH1_1_2 2         25
13 GT      20180101_HH1_1_2 4          1
14 RR      20180101_HH1_1_2 5          6
15 BR      20180101_HH1_1_2 7         12
16 RS      20180101_HH1_1_2 7         22
17 RS      20180101_HH1_1_2 8         30
18 BA      20180101_HH1_1_2 10         6
19 GS      20180101_HH1_1_3 1         66
20 RR      20180101_HH1_1_3 1         25
21 SH      20180101_HH1_1_3 1        500
22 GT      20180101_HH1_1_3 2          1
23 RR      20180101_HH1_1_3 2          6
24 BR      20180101_HH1_1_3 3         12
25 RS      20180101_HH1_1_3 3         30
26 BA      20180101_HH1_1_3 4          6