0
votes

So picture these two groups of females and males ages:

 femalesage<-c(30,52,59,25,26,72,46,32,64,45)
 malesage<-c(40,56,31,63,63,78,42,45,67)

I can easily do a t.test(females age,malesage) to achieve the following result:

 t.test(femalesage,malesage)

Welch Two Sample t-test

 data:  femalesage and malesage
 t = -1.2013, df = 16.99, p-value = 0.2461
 alternative hypothesis: true difference in means is not equal to 0
 95 percent confidence interval:
 -24.224797   6.647019
 sample estimates:
 mean of x mean of y 
 45.10000  53.88889 

Now, suppose I have this same data organized differently, so something like this:

ages<-c(30,52,59,25,26,72,46,32,64,45,40,56,31,63,63,78,42,45,67)
genders<-c("F","F","F","F","F","F","F","F","F","F","M","M","M","M","M","M","M","M","M","M")
df<-data.frame(ages, genders)

I want to achieve a similar result to the welsh two-sample t-test using some sort of regression test, testing the slope of Beta1=0 vs. Beta1 not equal to 0, where B1 is the coefficient of Gender and the response is ages. Any idea how I could get the same result?

1
Just so we are clear - you don't want to assume equal variancesDason

1 Answers

1
votes

The t-test and linear regression are both special cases of the general linear model. In the case of a single predictor, testing for the significance of the regression coefficient is equivalent to the significance of the t-test.

R's t.test function allows the input data to be specified two different ways: either as two separate vectors, as you have done, or with a formula interface as I do here. Similarly, the lm function, which performs simple linear regression, requires the formula interface. In this case, this makes the two function calls identical, and we need only change the name of the function.

Your data:

ages <- c(30,52,59,25,26,72,46,32,64,45,40,56,31,63,63,78,42,45,67)
genders <- c("F","F","F","F","F","F","F","F","F","F","M","M","M","M","M","M","M","M","M","M")
df <- data.frame(ages, genders)

A t-test:

t.test(ages ~ genders, data = df)

    Welch Two Sample t-test

data:  ages by genders
t = -1.2013, df = 16.99, p-value = 0.2461
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -24.224797   6.647019
sample estimates:
mean in group F mean in group M 
       45.10000        53.88889 

A (nearly) identical regression:

summary(lm(ages ~ genders, data = df))

Call:
lm(formula = ages ~ genders, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-22.89 -13.49   0.90  11.11  26.90 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   45.100      5.060   8.914 8.12e-08 ***
gendersM       8.789      7.351   1.196    0.248    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16 on 17 degrees of freedom
Multiple R-squared:  0.07756,   Adjusted R-squared:  0.0233 
F-statistic: 1.429 on 1 and 17 DF,  p-value: 0.2483

Note that t and beta of gender are nearly the same, as are the p-values.