1
votes

Please could somebody suggest how I would go about making a principle component analysis with the gene data set I have.

I have a table containing 15 columns. The first one is disease group, where 0 is control, 1 is Ulcerative Colitis and 2 stands for Crohn’s.

The remaining 14 columns correspond to 14 different genes.

I would like to plot a PC1 vs PC2 following PCA (via prcomp), to show whether any clustering or separation between the three groups occurs based on gene expression data ( with each axis showing the proportion of variance). However, I am struggling to know where to start, as I cannot convert my column 1 to row names via row.names=1 as R doesn’t allow repeating row names.

Converting to a matrix and trying to use the below code, does not work.

mockdata1 <- as.matrix(mockdata)
rownames(mockdata1) <- mockdata1[,1]
mockdata1[,1] <- NULL
or 
mockdata2 <-mockdata1 [ ,-1]

With the previous examples that I have done, I have been able to compute the PCA and plot the PCA1 vs PCA2 and colour the data accordingly, following row.names=1, but not sure how to overcome this first initial problem, as I can't use this here.

I have included my data below via dput(head(mockdata))

structure(list(Disease = c(1L, 1L, 0L, 0L, 2L, 2L), Gene1 = c(9104.774619, 
35924.12358, 6.780294688, 1284.690716, 69.50341155, 3935.107345
), Gene2 = c(5224.114486, 35625.73119, 18.35291351, 511.9272679, 
186.7270146, 47611.65544), Gene3 = c(1472.348466, 137571.5525, 
20.78531289, 3019.140256, 146.9615338, 108935.1303), Gene4 = c(2487.124686, 
147604.774, 3.574347972, 1371.576262, 210.6773417, 82831.97458
), Gene5 = c(1872.328747, 235675.6461, 9.834667594, 583.1631957, 
120.6931223, 75874.49936), Gene6 = c(1675.724728, 35931.1852, 
9.91026361, 1634.038443, 58.04818134, 23502.78972), Gene7 = c(3775.885073, 
169672.9921, 5.41305941, 929.2125312, 97.72621248, 46023.7009
), Gene8 = c(5015.202216, 137455.0032, 2.995124554, 1113.882634, 
83.17636201, 14048.19237), Gene9 = c(883.5716868, 45920.44167, 
6.399646876, 892.313155, 117.1104906, 10825.47974), Gene10 = c(1607.790858, 
146627.0588, 1.967559425, 1237.299298, 90.8941744, 32747.04713
), Gene11 = c(2345.478241, 91047.57303, 12.33867961, 663.576224, 
384.5839119, 6692.728154), Gene12 = c(2772.362496, 15511.96753, 
15.64843017, 4143.085461, 169.545757, 22484.03574), Gene13 = c(4131.51741, 
48601.7059, 21.66175797, 2250.0628, 316.0677196, 16612.6508), 
Gene14 = c(1252.440598, 54794.36695, 2.925615978, 708.0342528, 
211.822519, 14021.28425)), row.names = c(NA, 6L), class = "data.frame")
1
The problem you are running into is that each disease has two of each gene with different values. This presents a problem because you have no indication of which of the two identical genes per disease to pair up between the diseases. You would likely eliminate this problem if each Gene and value had another factor level, such as biological replicate or source.rpolicastro

1 Answers

0
votes

I read about the statistics behind PCA a long time ago. It basically generates df*t(df). If you transpose this you could get 14 PCs across the genes. Or you could do this across the 6 samples. Not sure which one you are interested in, comparing genes, or diseases?

For now, just change the Disease column.

df1 <- df[,-1]

prcomp(df1, scale=TRUE)
Standard deviations (1, .., p=6):
[1] 3.535100e+00 1.197773e+00 2.341240e-01 1.164908e-01 4.575297e-03 1.341079e-16

Rotation (n x k) = (14 x 6):
             PC1         PC2         PC3          PC4         PC5         PC6
Gene1  0.2599087 -0.28372577  0.85593436  0.105415154  0.22237406 -0.03275210
Gene2  0.2217516  0.51374227  0.19750686  0.587702793 -0.28955565 -0.05771744
Gene3  0.2658620  0.28239514 -0.18315564  0.178995551  0.20110401  0.55717953
Gene4  0.2793220  0.12737931 -0.15464656  0.169556323  0.07195134 -0.18818707
Gene5  0.2818335 -0.06260368 -0.17524380  0.066508759  0.19418520 -0.23609300
Gene6  0.2753095  0.19146854 -0.05941163 -0.009035774  0.27267048 -0.02872877
Gene7  0.2804052 -0.10775145 -0.11519116  0.035882068  0.20239393 -0.18183889
Gene8  0.2697234 -0.25139746 -0.04378742 -0.067463492  0.21022545  0.41231175
Gene9  0.2787618 -0.13908918 -0.13407274 -0.103917770 -0.05348686  0.14863994
Gene10 0.2782801 -0.14660837 -0.15870274 -0.039789645  0.20346369 -0.24515625
Gene11 0.2672274 -0.27313184 -0.09102046 -0.087085239 -0.41532869  0.18704109
Gene12 0.2089616  0.55676419  0.20707121 -0.730763260  0.02663728 -0.08109933
Gene13 0.2819621 -0.06202963  0.11256511 -0.133063115 -0.54206507  0.22546812
Gene14 0.2797161 -0.12219139 -0.12034622 -0.025956127 -0.33551757 -0.46422864

for PCs across diseses transpose your matrix:

df2 <- t(df)
colnames(df2) <- df$Disease
df3 <- df2[-1,]
prcomp(df3,scale=TRUE)
Standard deviations (1, .., p=6):
[1] 1.3881805 1.3086775 1.0332001 0.8999575 0.5686278 0.3994426

Rotation (n x k) = (6 x 6):
         PC1        PC2         PC3        PC4        PC5         PC6
1 -0.1761229 -0.4084611  0.42521515 -0.7364243  0.2424742 -0.14218944
1  0.5843042  0.3073572 -0.04722848 -0.2941286  0.4419985  0.52916471
0 -0.5145338  0.4178422  0.04746496 -0.3238892 -0.4356411  0.51353922
0 -0.4042091  0.3652627  0.47093069  0.3634759  0.5908253 -0.01526729
2 -0.3531166  0.1750554 -0.74358399 -0.2618183  0.3973516 -0.25555829
2  0.2734007  0.6324854  0.20003938 -0.2561246 -0.2215796 -0.60868811