Comparing multiple data frames based on unique values in one column and finding overlapping values in second column in multiple data frames in R

Question

I wanted to ask for advice based on a problem I am having in trying to identify intersecting values in multiple data frames, but in my mind this is a bit complex and I cant figure out how to do it using the normal intersect function.

I have several data frames (up to 12) with multiple columns that are showing gene changes over time (for example 5 time points) and how other genes correlate with this change (i.e, other genes that also go down, or up in a manner that correlates other genes in the data). The analysis takes each gene one at a time, uses that gene as a reference and tests every single gene against it to see if the pattern of change over time of those genes correlate with the first reference gene. This is repeated for every single gene. So taking one data frame as an example, the results would appear as follows.

Column 1 contains genes that serve as the reference gene, this value can occur multiple times if other genes correlate with changes over time in this gene. for example if gene b, c and d correlate with gene a, the first two columns show as follows:

a b
a c
a d

The same for gene b and so on and so fourth 20,000 times (number of genes)! Hope this makes sense?

b a
b c
b d

The analyses above is carried in multiple different samples, so I will get up to 12 data frames which are different samples each with results detailed as above.

Objective (and apologies in advance that I do not have code as I am not entirely sure where to start!) as I am thinking this might best be served by creating a function for this: For gene 'x' in column number 1, in every single data frame, I would like to see if column 2 has overlapping values.

Taking the example above, multiple data frames may look like this:

df1
a b
a c
a d
df2
a d
a c
a e
df3
a d
a e
a f

So comparing the data frames, the function would identify that for gene a, there is one column value between all data frame... gene d.. as it is common to all data frames for gene a.

Similarly, the function would carry out this overlap analysis for every single gene... gene a,b,c..etc

The output would be the values of the overlap for every single gene in column 2 that occurs for the same gene in column a across the data frames

I am pasting head(analysis)

Feature1           Feature2 delay      pBefore       pAfter  corBefore
1 ENSMUSG00000001525 ENSMUSG00000026211     0 0.1093914984 0.1093914984  0.7161907
2 ENSMUSG00000001525 ENSMUSG00000055653    -1 0.0916478944 0.1047749696  0.7414240
3 ENSMUSG00000001525 ENSMUSG00000003038     0 0.0006810160 0.0006810160  0.9786161

plus many many more genes in feature 1, each with genes in feature 2 associated with genes in feature 1

this data frame would be one sample and I would have a separate result for the other samples

I would really appreciate any hints as to how to create code to achieve this goal. In additon, it would be nice to be able to specify that I would also liek to see over lap of genes that only contain, i.e pBefore of >= 0.8 for example, or same for the delay column etc...

Many thanks for taking the time to read this!

No_body No_body · Accepted Answer · 2019-01-26T21:53:48

If I understand correctly, you can add all 12 dataframes as

   df_final = pd.concat([df1,df2.....df12])

Find the combination of genes present in all 12 dataframe

   df_n = df_final.groupby(['A','B']).size().reset_index(name = 'count')

As there are 12 Dataframe

   df_n[df_n['count']==12]

will give you the pair of genes in all 12 dataframes.

Comparing multiple data frames based on unique values in one column and finding overlapping values in second column in multiple data frames in R

1 Answers