0
votes

I'm having trouble figuring out how code for the divisions of numerical values: (col1)/(col2), based on the True/False values in columns 3 & 4

I have 500 rows of data and I'm trying to calculate the mean yield of a crop (kg crop/hectares) based on different conditions. I'm trying to answer a question like "what would the mean yield be if the condition in column 3 was True and column 4 was False?"

EDIT: here is example data.

col 1   col2   col 3   col4
1.5     2.0     T       T
1.5     2.0     F       T
2.5     5.0     F       F
2.5     5.0     F       T

so I'm trying to find the mean of col1/col2 if, for example, col3 = F and col4 = T

thank you!

3
It's easier to help you if you include a simple reproducible example with sample input and desired output that can be used to test and verify possible solutions. For your example, just include a few rows rather than all 500.MrFlick
As a general note: If one of the answers has helped you with your problem, you should accept it as correct. This helps the community know that questions have been answered.Dylan_Gomes

3 Answers

0
votes

You need to subset your data based on the two conditions. You can do that using [col3 & !col4], like so:

mean(with(data,col1[col3 & !col4]/col2[col3 & !col4]))

(with is just an easier way to not have to keep writing data$ every time).

For example, here's some fake data:

data<-data.frame(col1=1:5,col2=10:6,col3=c(TRUE,TRUE,TRUE,FALSE,FALSE),col4=c(FALSE,TRUE,FALSE,FALSE,TRUE))

and here's what you get from my solution:

mean(with(data,col1[col3 & !col4]/col2[col3 & !col4]))
[1] 0.2375
0
votes

Here is a solution using the tidyverse.

First I create a reproducible data set using the function data.frame.

n <- 100
df <- data.frame(
  kg = runif(n),
  ha = runif(n),
  col3 = sample(c(TRUE,FALSE), n, TRUE),
  col4 = sample(c(TRUE,FALSE), n, TRUE)
)

We can use mutate to calculate yield from our kg and ha variables, group the data by the two conditions, and finally summarize the mean yield for each combination of conditions.

df %>% 
  mutate(kg_ha = kg/ha) %>% 
  group_by(col3, col4) %>% 
  summarize(mean_yield = mean(kg_ha))
0
votes

First, I will create a reproducible dataset:

data<-data.frame("col1"=rnorm(500),"col2"=rnorm(500),
                 "col3"=c(rep(TRUE,250),rep(FALSE,250)),"col4"=c(rep(FALSE,250),rep(TRUE,250)))

If I understand you correctly, I am guessing your "(col1)/(col2)" corresponds to "(kg crop/hectares)" here.

If this is true, you can create a new column in your dataset (named 'data' here) for 'yield' by:

data$yield <- data$col1 / data$col2


head(data)
        col1         col2 col3  col4       yield
1  0.8976488  0.006764518 TRUE FALSE 132.6996029
2 -0.2829754  0.980092790 TRUE FALSE  -0.2887230
3 -0.2266733  1.285616004 TRUE FALSE  -0.1763149
4  1.4690071 -0.297252879 TRUE FALSE  -4.9419440
5 -0.1438242  0.917662116 TRUE FALSE  -0.1567289
6 -1.3297183 -0.880964698 TRUE FALSE   1.5093889

Then there are multiple ways to look at these means. One 'indexing' way would be:

mean(data$yield[data$col3==T & data$col4==F])
[1] 1.929354

This is asking what the yield is when col3 of data is True and col4 is False specifically.

However, if you want the summary of all possible combinations of groups, you can use the package dplyr this way:

install.packages("dplyr") # This will have to be run only the first time you use the package on one machine
library(dplyr) # This code will need to be run every new R session

data %>% group_by(col3,col4) %>% 
  summarise(
    MeanYield = mean(yield)
  )

# A tibble: 2 x 3
# Groups:   col3 [2]
  col3  col4  MeanYield
  <lgl> <lgl>     <dbl>
1 FALSE TRUE      20.4 
2 TRUE  FALSE      1.93

In this case there were only two possibilities (col3=T & col4=F) and (col3=F & col4=T), but the code will give you all possibilities.

  • As an afternote: I know that negative values may not make sense for crop or hectares, I simply used rnorm to be quick here (although my explaining this defeats the purpose of being quick).