4
votes

I have a dataframe containing different groups, years and their values, for example:

data <- data.frame(
  group = c(rep('A', 120), rep('B', 120)),
  year  = rep(c(rep('2013-2014', 40), rep('2014-2015', 40), rep('2015-2016', 40)), 2),
  value = rnorm(240)
)

For each year within each group I want to run a t-test to see whether the values are significantly different to the previous years (I have been using the function t.test(x, y, var.equal = TRUE) to do this on a one-off)

I would like to return the a dataframe along with the p-values, or preferably significant stars generated using gtools::stars.pval(). So to return something like the following

group year      significance
A     2013-2014 NA
A     2014-2015 **
A     2015-2016 ***
B     2013-2014 NA
B     2014-2015
B     2015-2016

Where in the above the p value for difference between 2014-2015 and 2013-2014 for 'A' is between 0.001 and 0.01, and the p-value for the difference between 2015-2015 and 2014-2015 for A is <0.001. There is no evidence of any significant difference in any years for B.

There is no guarantee that each of the groups have the same number of years.

What is the best and quickest way of doing this? I was hoping that I could do it using dplyr and group_by by group and year?

2

2 Answers

9
votes

Another option is to summarise the data frame, storing all the values in one cell as a list (yes, you can do that - data frames can have nested lists inside!)

Using dplyr:

df=tbl_df(data)
df=arrange(df,group,year) %>% group_by(group,year) %>% summarise(values=list(value))
df=mutate(df,prev_values=lag(values))
df=group_by(df,group,year)
df=filter(df,!any(is.na(unlist(prev_values))))
df=mutate(df,p_value=t.test(unlist(values),unlist(prev_values),var.equal=TRUE)$p.value) %>% print

  group      year    values prev_values   p_value
1     A 2014-2015 <dbl[40]>   <dbl[40]> 0.7894477
2     A 2015-2016 <dbl[40]>   <dbl[40]> 0.2385581
3     B 2014-2015 <dbl[40]>   <dbl[40]> 0.3084138
4     B 2015-2016 <dbl[40]>   <dbl[40]> 0.2557849
2
votes

I really liked @MaksimGayduk 's solution. Especially the "trick" with the summarise(values=list(value)). Haven't used that before and it seems very useful. My alternative, but similar solution, is based on dplyr and broom packages.

The differences are that (a) I first create a table with the appropriate info for the t.tests of interest and then I call the corresponding values from the initial df data frame, and (b) broom package returns all info from t.test output as a dataframe from where you can pick p.value or anything else you need.

set.seed(15)

df <- data.frame(
  group = c(rep('A', 120), rep('B', 120)),
  year  = rep(c(rep('2013-2014', 40), rep('2014-2015', 40), rep('2015-2016', 40)), 2),
  value = rnorm(240)
)


library(dplyr)
library(broom)

df %>% 
  select(group, year) %>%
  arrange(group,year) %>%
  distinct() %>%
  group_by(group) %>%
  mutate(lag_year = lag(year)) %>%
  filter(!is.na(lag_year)) %>%
  group_by(group, year, lag_year) %>%
  do(tidy(t.test(df$value[df$year==.$year & df$group==.$group], 
                 df$value[df$year==.$lag_year & df$group==.$group])))


# Source: local data frame [4 x 11]
# Groups: group, year, lag_year [4]
# 
# group      year  lag_year    estimate   estimate1   estimate2  statistic   p.value parameter   conf.low conf.high
# (fctr)    (fctr)    (fctr)       (dbl)       (dbl)       (dbl)      (dbl)     (dbl)     (dbl)      (dbl)     (dbl)
# 1      A 2014-2015 2013-2014 -0.14570115  0.04597952  0.19168066 -0.6752803 0.5016009  74.05084 -0.5756153 0.2842130
# 2      A 2015-2016 2014-2015 -0.02752882  0.01845069  0.04597952 -0.1162621 0.9077438  77.96192 -0.4989302 0.4438726
# 3      B 2014-2015 2013-2014  0.39565472  0.05703318 -0.33862155  1.5776920 0.1187303  77.10933 -0.1037022 0.8950116
# 4      B 2015-2016 2014-2015 -0.07423089 -0.01719771  0.05703318 -0.3048113 0.7613240  77.77704 -0.5590850 0.4106233