How to create a data frame of various test results?

Question

In our experiment, we have a data frame with the following columns:

Participant	Condition	parametricVariables	nonParametricVariables	orderNumber
1	Condition 1	14.7	4	1
1	Condition 2	11.4	1	2
2	Condition 1	8.2	7	2
2	Condition 2	13.0	6	1
...	...	...	...	...

We have multiple parametric and multiple non-parametric variables and only two conditions. orderNumber column represents the order in which the given participant tested the given condition - so the participant 1 first tested the Condition 1 and then Condition 2, while participant 2 tested them in the opposite order.

We are trying to see whether there is, despite our best efforts, an unsystematic variation based on the order of the conditions. So far we have just been using function calls and read the results from the output like this:

ParticipantOrder1 <- gameSummary %>% filter(orderNumber == 1)
Condition1Order1 <- ParticipantOrder1 %>% filter(Condition==condition1_label)
Condition2Order1 <- ParticipantOrder1 %>% filter(Condition==condition2_label)

ParticipantOrder2 <- gameSummary %>% filter(orderNumber == 2)
Condition1Order2 <- ParticipantOrder2 %>% filter(Condition==condition1_label)
Condition2Order2 <- ParticipantOrder2 %>% filter(Condition==condition2_label)

# Check parametric variables for normality
# ...

# Check for difference in the parametric variable across the two orders using Welch's t-test
t.test(ParticipantOrder1$parametric, ParticipantOrder2$parametric)
t.test(Condition1Order1$parametric, Condition1Order2$parametric)
t.test(Condition2Order1$parametric, Condition2Order2$parametric)

# Check for difference in the non-parametric variable across the two orders using Wilcoxon signed ranked test
wilcox.test(ParticipantOrder1$nonParametric, ParticipantOrder1$nonParametric, paired=TRUE,exact=FALSE)
wilcox.test(Condition1Order1$nonParametric, Condition1Order2$nonParametric, paired=TRUE,exact=FALSE)
wilcox.test(Condition2Order1$nonParametric, Condition2Order2$nonParametric, paired=TRUE,exact=FALSE)

As you can see, this approach gets rather unwieldy when one has multiple parametric and non-parametric variables. I am wondering whether there is a nicer way to collect all these test results into a table like this:

Variable	Condition	TestType	statistic	p-value
parametric1	Both	Welch Two Sample t-test	0.10317	0.9185
parametric1	Condition 1	Welch Two Sample t-test	0.625	0.5462
parametric1	Condition 2	Welch Two Sample t-test	-0.69369	0.503
nonParametric1	Both	Wilcoxon signed rank test with continuity correction	18	0.6295
...	...	...	...	...

sjaustirni sjaustirni · Accepted Answer · 2021-07-10T13:53:11

Group the data

First, we should group all data by a groupingVariable.

analysisSummary <- gameSummary %>%
  select(parametric1, parametric2, nonparametric1, groupingVariable) %>%
  gather(key = variable, value = value, -groupingVariable) %>%
  group_by(variable, groupingVariable) %>%
  summarise(value=list(value)) %>%
  spread(groupingVariable, value) %>% 
  group_by(variable)

If you want to see how this query was built, I recommend looking at this tutorial by Sebastian Sauer.

This will give us the following table with the groupingValues, which are the values of groupingVariable:

variable	groupingValue1	groupingValue2
parametric1	<dbl [X]>	<dbl [Y]>
parametric1	<dbl [X]>	<dbl [Y]>
nonparametric1	<dbl [X]>	<dbl [Y]>

parametric1, parametric1 and nonparametric1 are the variables you want to compare between the two groups.

groupingVariable is the metric you divide the populations by. For instance, it could be sex, in which case the groupingValues would probably be male and female [1]. Or, going by the example from the question, groupingVariable could be orderNumber and groupingValues would be 1 and 2. Note that these have numerical values - and that brings us to a problem.

Numerical `groupingVariable`s

R will treat numerical values for columns not as names, but as order number for columns in the table. If you want readable code, you can rename these columns to order1 and order2 using

analysisSummary <- analysisSummary %>% rename(order1 = 2, order2 = 3)

Assuming the groupingValue1 and groupingValue2 columns are on the 2nd and 3rd position in the table, respectively.

Run the tests

We can use case_when to conditionally run different tests for different variables.

analysisSummary %>% mutate(
    # Save the name of the test for convenient reference later
    test = case_when(
        isVariableParametric(variable) ~ "Welch's t test", TRUE ~ "Wilcoxon test"
    ),
    # Run the t-test for parametric variables and Wilcoxon signed rank test for non-parametric ones, save the p-value
    p_value = case_when(
        isVariableParametric(variable) ~ t.test(unlist(groupingVariable1), unlist(groupingVariable2))$p.value,
            TRUE ~ wilcox.test(unlist(groupingVariable1), unlist(groupingVariable2), paired=FALSE, exact=FALSE)$p.value
    ),
    # Run the test again, but now save the effect size
    statistic = case_when(
        isVariableParametric(variable) ~ t.test(unlist(groupingVariable1), unlist(groupingVariable2))$statistic,
            TRUE ~ wilcox.test(unlist(groupingVariable1), unlist(groupingVariable2), paired=FALSE, exact=FALSE)$statistic
    ),
)

You should also define a function that decides whether the variable is parametric or not. In my case I hardcoded it (but a long-term, reusable solution would be to resolve it dynamically):

isVariableParametric <- function(variable) {
  variable %in% c('parametric1', 'parametric2')
}

This will give us a table with easily browsable results:

variable	groupingValue1	groupingValue2	test	p-value	statistic
parametric1	<dbl [X]>	<dbl [Y]>	Welch's t test	0.19081	0.23504
parametric1	<dbl [X]>	<dbl [Y]>	Welch's t test	0.16398	0.00014
nonparametric1	<dbl [X]>	<dbl [Y]>	Wilcoxon test	0.78727	87.5000

[1] Sticking to two groups for simplicity here, as testing for difference between multiple groups require additional checks statistics-wise (Bonferroni correction) or a different approach (ANOVA).

How to create a data frame of various test results?

1 Answers

Group the data

Numerical groupingVariables

Run the tests

Numerical `groupingVariable`s