2
votes

I'm struggling to figure out how to do some ANOVA testing for a collection of responses in multiple environments & sample types. I have a big dataframe, and I want to run these ANOVAs separately by these grouping factors, but they shouldn't be included in the formula.

Maybe I'm missing an obvious way to do this, but so far I've just been subsetting the data is tedious to do..

In SAS, this would be similar to a BY statement within the PROC statement.. it is quite easy to do in SAS.

Here is an example:

data(iris)
data = iris

# suppose the iris data was in two environments..
data$location = unlist(rep(list("US", "USSR"), length(data$Sepal.Length)/2))
data$location = as.factor(data$location)
# and suppose that there was another subgrouping..
data$subgroup = unlist(list(rep(c(rep("A", 25), rep("B", 25)), 3)))
data$subgroup = as.factor(data$subgroup)
# and suppose I only want to look at the differences between two species
somedata = subset(data, Species == "setosa" | Species == "versicolor")
somedata = droplevels(somedata)
# suppose that I want to test if sepal length and sepal width are different..
# between species BY location AND BY subgroup
# and I'm only interested in the pvalues for these comparisons
# in my real data, I have more than just two responses I want to test..

# I could subset all the data.. which is what I've been doing..
# by location
dataUS = subset(somedata, location == "US")
dataUSSR = subset(somedata, location == "USSR")

# then by species
dataUS_A = subset(dataUS, subgroup=="A")
dataUS_B = subset(dataUS, subgroup=="B")

dataUSSR_A = subset(dataUSSR, subgroup=="A")
dataUSSR_B = subset(dataUSSR, subgroup=="B")

t.test(Sepal.Width ~ Species, data=dataUS_A)

Can someone offer a faster way to get me the p values for these comparisons? Possibly with multiple testing control, such as tukey?

I've also tried making the data in long format, and using the subset option, but that is almost as tedious.

I've looked at anova, aov, and a couple other approaches but I've gotten stuck with each. I've also tried making my data wide format and doing something like this:

summary( aov(as.matrix(cbind(somedata[,c(1:2)])) ~ Species*location*subgroup, data=somedata) )

but I still can't figure out how to separate by a group & subgroup

I've also tried concatenating columns into one big "group" column and using that new column as my grouping but that doesn't work either:

somedata$group = do.call(paste, c(somedata[c("location","subgroup")], sep = "_"))
3

3 Answers

3
votes

Okay, first off, your code is unnecessarily complicated. Check out several simplifications below that will hopefully make things simpler.

Then, to your core question, you want a split-apply-combine strategy. You need to split the data by the relevant grouping variables, then do a t.test in each subset of the data. You can achieve this with a combination of split and lapply. (Per your comment, to obtain results for multiple outcomes, you need nested lapply functions).

# the data, again
data <- iris
data$location <- factor(rep(c("US", "USSR"), length.out = length(data$Sepal.Length)))
data$subgroup <- factor(rep(c(rep("A", 25), rep("B", 25)), 3))
somedata <- data[data$Species %in% c("setosa","versicolor"),]

DVs <- c('Sepal.Width','Sepal.Length','Petal.Length')
out <- lapply(DVs, function(x){
    lapply(split(somedata, list(somedata$location, somedata$subgroup)),
    function(z) {
        t.test(update(~ Species,paste(x,'~.')), data=z)$p.value
     })
})

Here's the result:

> do.call(cbind, setNames(out,DVs))
       Sepal.Width  Sepal.Length Petal.Length
US.A   9.183405e-05 9.371858e-06 5.852323e-14
USSR.A 0.0001211233 0.0001385488 2.97461e-12 
US.B   8.473525e-06 0.0001902751 4.123647e-11
USSR.B 0.001818272  5.308597e-06 7.593105e-11
2
votes

Simplifying your code a bit:

data(iris)
df = iris
df$location = factor(rep(c("US", "USSR"), nrow(df)/2))
df$subgroup = factor(rep(c("A", "B"), each = 25, 3))
df = subset(df, Species != "virginica")

Defining the function to use in by

grouptest = function(dat){
    t.test(Sepal.Width ~ Species, data = dat)
}

Running by location and subgroup:

by(df, df[,c("location", "subgroup")], FUN = grouptest)
0
votes

Have you looked at the by function in R? If that is not enough then you might consider the plyr package (for split, operate, combine).