3
votes

I would like to perform a two-sample t-test on my data within R. Given two high-dimensional data frames, I need to somehow loop through matching columns (String colnames() in header) over all rows and perform the test for each column pair - one from df1 and df2, respectively. The problem is that the columns from the data frames are not in right order, i.e. col1 form df1 doesn't match col1 from df2, and df2 has additional columns that don't exist in df1. I've never used R for such tasks and I wonder if there is a fast and handy solution to find matching column pairs in the data frames for the t-test.

I thought about for-loops but I think this would be very inefficient for large data frames.

Thank you in advance for any help.

*EDITED-------Two small example dataframes, df1 and df2--------------------------------

****df1****

"Row\Column"    "A2"    "A1"    "A4"    "A3"
"id_1"           10      20      0       40
"id_2"           5       15      25      35
"id_3"           8       0       12      16
"id_4"           17      25      0       40

****df2****

"Row\Column"    "A3"    "A8"    "A5"    "A6"    "A1"    "A7"    "A4"    "A2"
"id_1"           0       2       0       4       0       1       2       3
"id_2"           1       5       8       3       4       5       6       7
"id_3"           2       10      6       9       8       9       10      11
"id_4"           7       2       10      2       55      0       0       0
"id_5"           0       1       0       0       9       1       3       4
"id_6"           8       0       1       2       7       2       3       0  

Matching columns are nothing but the columns names in df1 matching with the columnsnames in df2. For example Two matching columns in df1 and df2 are e. g. "A1" and "A1", "A2" and "A2" ... you get the idea...

2
Hello and welcome to SO. A sample of df1 and df2 would be helpful. You can use dput(head(df1)); dput(head(df2)) and copy+paste the output into your question.Ricardo Saporta
...also be sure to include a precise explanation of what you mean by "matching columns". Because there's no way for us to just guess which columns should go together.joran
Hi, I added two sample dataframes to my topic. I hope that my examples make it more clear for everyone:)myX.

2 Answers

0
votes

Very hard to give you a good answer without a reproducible example. You need to define also what do you mean by matching columns.

Here an example of 2 data.frames that have some columns names in common.

df1 <- matrix(sample(1:100,5*5,rep=TRUE),ncol=5,nrow=5)
df2 <- matrix(sample(1:100,5*8,rep=TRUE),ncol=8,nrow=5)
colnames(df1) <- letters[6:10]
colnames(df2) <- rev(letters[1:8])

Then I define a wrapper of t.test , to limit for example the ouput to the p-values and the degree of freedom.

f <- function(x,y){
  test <- t.test(x,y)
  data.frame(df   = test$parameter,
                    pval = test$p.value)
}

Then using sapply I iterate over common columns that I get using intersect

sapply(intersect(colnames(df1),colnames(df2)), 
                 function(x) f(df1[,x], df2[,x]))

     f         g         h        
df   7.85416   6.800044  7.508915 
pval 0.5792354 0.2225824 0.4392895
4
votes

mapply is the function you are looking for.
if the columns of your data.frames matched up, you could simply use

mapply(t.test, df1, df2)

However, since they do not, you somehow need to identify which column of df1 goes with which column of df2. Fortunately, the indexing options in R are clever, and if you feed in a vector (a collection) of column names, you will get back those columns in the order given. This makes life easy.

# find the matching names
## this will give you those names in df1 that are also in df2
## and *only* such names (ie, strict intersect)
matchingNames <- names(df1)[names(df1) %in% names(df2)]

Notice that matchingNames has some order to it Now look what happens when you use the matchingNames vector as an index to the columns of each of df1 and df2 (note also the column order)

df1[, matchingNames]
df2[, matchingNames]
matchingNames    

Therefore, we now have two data.frames with properly matched columns, which we can use to mapply over.

## mapply will apply a function to each data.frame, one pair of columns at a time

## The first argument to `mapply` is your function, in this example, `t.test`
## The second and third arguments are the data.frames (or lists) to simultaneously iterate over
mapply(t.test, df1[, matchingNames], df2[, matchingNames])