8
votes

I'm attempting to run some statistical analyses on a field trial that was constructed over 2 sites over the same growing season.

At both sites (Site, levels: HF|NW) the experimental design was a RCBD with 4 (n=4) blocks (Block, levels: 1|2|3|4 within each Site). There were 4 treatments - 3 different forms of nitrogen fertiliser and a control (no nitrogen fertiliser) (Treatment, levels: AN, U, IU, C). During the field trial there were 3 distinct periods that commenced with fertiliser addition and ended with harvesting of the grass. These periods have been given the levels 1|2|3 under the factor N_app.

There are a range of measurements that I would like to test the following null hypothesis H0 on:

Treatment (H0) had no effect on measurement

Two of the measurements I am particularly interested in are: grass yield and ammonia emissions.

Starting with grass yield (Dry_tonnes_ha) as shown here, a nice balanced data set

The data can be downloaded in R using the following code:

library(tidyverse)

download.file('https://www.dropbox.com/s/w5ramntwdgpn0e3/HF_NW_grass_yield_data.csv?raw=1', destfile = "HF_NW_grass_yield_data.csv", method = "auto")
raw_data <- read.csv("HF_NW_grass_yield_data.csv", stringsAsFactors = FALSE)

HF_NW_grass <- raw_data %>% mutate_at(vars(Site, N_app, Block, Plot, Treatment), as.factor) %>% 
  mutate(Date = as.Date(Date, format = "%d/%m/%Y"),
         Treatment = factor(Treatment, levels = c("AN", "U", "IU", "C")))

I have had a go at running an ANOVA on this using the following approach:

model_1 <- aov(formula = Dry_tonnes_ha ~ Treatment * N_app + Site/Block, data = HF_NW_grass, projections = TRUE)

I have a few concerns with this.

Firstly, what is the best way to test assumptions? For a simple one-way ANOVA I would use shapiro.test() and bartlett.test() on the dependent variable (Dry_tonnes_ha) to assess normality and heterogeneity of variance. Can I use the same approach here?

Secondly, I am concerned that N_app is a repeated measure as the same measurement is taken from the same plot over 3 different periods - what is the best way to build this repeated measures into the model?

Thirdly, I'm not sure of the best way to nest Block within Site. At both sites the levels of Block are 1:4. Do I need to have unique Block levels for each site?

I have another data set for NH3 emissions here. R code to download:

download.file('https://www.dropbox.com/s/0ax16x95m2z3fb5/HF_NW_NH3_emissions.csv?raw=1', destfile = "HF_NW_NH3_emissions.csv", method = "auto")
raw_data_1 <- read.csv("HF_NW_NH3_emissions.csv", stringsAsFactors = FALSE)

HF_NW_NH3 <- raw_data_1 %>% mutate_at(vars(Site, N_app, Block, Plot, Treatment), as.factor) %>% 
  mutate(Treatment = factor(Treatment, levels = c("AN", "U", "IU", "C")))

For this I have all the concerns above with the addition that the data set is unbalanced. At HF for N_app 1 n=3, but for N_app 2 & 3 n=4 At NW n=4 for all N_app levels. At NF measurements were only made on the Treatment levels U and IU At NW measuremnts were made on Treatment levels AN, U and IU

I'm not sure how to deal with this added level of complexity. I am tempted to just analyse as 2 separate site (the fact that the N_app periods are not the same at each site may encourage this approach). Can I use a type iii sum of squares ANOVA here?

It has been suggested to me that a linear mixed modelling approach may be the way forward but I'm not familiar with using these.

I would welcome your thoughts on any of the above. Thanks for your time.

Rory

2

2 Answers

4
votes

To answer your first question on the best way of testing assumptions. While your attempt of using another statistical test, implemented in R, is reasonable, I would actually just visualize the distribution and see if the data meet ANOVA assumptions. This approach may seem somewhat subjective, but it does work in most cases.

  • independently, identically distributed (i.i.d) data: this is a question that you may already have an answer based on how much you know about your data. It's possible to use a chi-square test to determine independence (or not).
  • normally distributed data: use a histogram / QQ plot to check. Based on the distribution, I think it is reasonable to use aov despite the slightly bimodal distribution.

(It appears that log-transformation help further meet normality assumption. This is something you may consider, especially for downstream analyses.)

par(mfrow=c(2,2))
plot(density(HF_NW_grass$Dry_tonnes_ha), col="red", main="Density")
qqnorm(HF_NW_grass$Dry_tonnes_ha, col="red", main="qqplot")
qqline(HF_NW_grass$Dry_tonnes_ha)

DTH_trans <- log10(HF_NW_grass$Dry_tonnes_ha)
plot(density(DTH_trans), col="blue", main="transformed density")
qqnorm(DTH_trans, col="blue", main="transformed density")
qqline(DTH_trans)

Regarding your second question on what the best way to build repeated measures into the model is: Unfortunately, it is difficult to pinpoint such a "best" model, but based on my knowledge (mostly through genomics big data), you may want to use a linear mixed effect model. This can be implemented through the lme4 R package, for example. Since it appears you already know how to construct a linear model in R, you should have no problem with applying lme4 functions.

Your third question regarding whether to nest two variables is tricky. If I were you, I would start with Site and Block as if they were independent factors. However, if you know they are not independent, you should probably nest them.

I think your questions and concerns are quite open-ended. My recommendation is that as long as you have a plausible justification, go ahead and proceed.

1
votes

I agree with @David C on the use of visual diagnostics. Simple QQ plots should work

# dependent variable.
par(mfrow=c(1,2))
qqnorm(dt[,dry_tonnes_ha]); qqline(dt[,dry_tonnes_ha], probs= c(0.15, 0.85))
qqnorm(log(dt[,dry_tonnes_ha])); qqline(log(dt[,dry_tonnes_ha]), probs= c(0.15, 0.85))

enter image description here

The log transformation looks reasonable to me. You can also see this from the density plot, which is long tailed and somewhat bi-modal

par(mfrow=c(1,1))
plot(density(dt[,dry_tonnes_ha]))

You could alternatively use lineup plots (Buja et al, 2009) if you wish. I'm not sure they're needed in this case. Vignette provided

library(nullabor)
# this may not be the best X variable. I'm not familiar with your data
dt_l <- lineup(null_permute("dry_tonnes_ha"), dt)
qplot(dry_tonnes_ha, treatment, data = dt_l) + facet_wrap(~ .sample)

enter image description here

For the other assumptions, you can just use the standard diagnostic plots from the lm

lm2 <- lm(log(dry_tonnes_ha) ~ treatment * n_app + site/block, data = dt)
plot(lm2)

I don't see anything too troublesome in these plots.