4
votes

I have gene expresssion data from 77 cancer patients. I have one set from tha patients blood, one set from the patients tumor and one set from the patients healty tissue:

data1 <- ExpressionBlood
data2 <- ExpressionCancerTissue
data3 <- ExpressionHealtyTissue

I would like to perform an analysis to se if the expression in the tumor tissue correlate with the expression in the blood for all my genes. What is the best way to do this?

1
What about calculating the log-fold change?SmallChess

1 Answers

3
votes

If you are familiar with python I'd use pandas. It uses "DataFrames" similarly to R, so you could take the concept and apply it to R.

Assuming your data1 is a delimited file formatted like this:

GeneName | ExpValue |
gene1       300.0
gene2       250.0

Then you can do this to get each data type into a DataFrame:

dfblood = pd.read_csv('path/to/data1',delimiter='\t')
dftissue = pd.read_csv('path/to/data2',delimiter='\t')
dftumor = pd.read_csv('path/to/data3',delimiter='\t')

Now merge the DataFrame's into one master df.

dftmp = pd.merge(dfblood,dftissue,on='GeneName',how='inner')
df = pd.merge(dftmp,dftumor,on='GeneName',how='inner')

Rename your columns, be careful to ensure the proper order.

df.columns = ['GeneName','blood','tissue','tumor']

Now you can normalize your data (if it's not already) with easy commands.

df = df.set_index('GeneName') # allows you to perform computations on the entire dataset
df_norm = (df - df.mean()) / (df.max() - df.min())

You can all df_norm.corr() to produce the results below. But at this point, you can use numpy to perform more complex calculations, if needed.

          blood      tissue       tumor
blood   1.000000    0.395160    0.581629
tissue  0.395160    1.000000    0.840973
tumor   0.581629    0.840973    1.000000

HTH at least move in the right direction.

EDIT

If you want to use Student T's log-fold change you could calculate the log of the original data using numpy.log

import numpy as np

df[['blood','tissue','tumor']] = df[['blood','tissue','tumor']]+1
# +1 to avoid taking the log of 0
df_log = np.log(df[['blood','tissue','tumor']])

To get the 'log' fold change for each gene, this will append new columns to your df_log DataFrame.

df_log['logFCBloodTumor'] = df_log['blood'] - df_log['tumor']
df_log['logFCBloodTissue'] = df_log['blood'] - df_log['tissue']