Please could somebody suggest how I would go about making a principle component analysis with the gene data set I have.
I have a table containing 15 columns. The first one is disease group, where 0 is control, 1 is Ulcerative Colitis and 2 stands for Crohn’s.
The remaining 14 columns correspond to 14 different genes.
I would like to plot a PC1 vs PC2 following PCA (via prcomp), to show whether any clustering or separation between the three groups occurs based on gene expression data ( with each axis showing the proportion of variance). However, I am struggling to know where to start, as I cannot convert my column 1 to row names via row.names=1 as R doesn’t allow repeating row names.
Converting to a matrix and trying to use the below code, does not work.
mockdata1 <- as.matrix(mockdata)
rownames(mockdata1) <- mockdata1[,1]
mockdata1[,1] <- NULL
or
mockdata2 <-mockdata1 [ ,-1]
With the previous examples that I have done, I have been able to compute the PCA and plot the PCA1 vs PCA2 and colour the data accordingly, following row.names=1, but not sure how to overcome this first initial problem, as I can't use this here.
I have included my data below via dput(head(mockdata))
structure(list(Disease = c(1L, 1L, 0L, 0L, 2L, 2L), Gene1 = c(9104.774619,
35924.12358, 6.780294688, 1284.690716, 69.50341155, 3935.107345
), Gene2 = c(5224.114486, 35625.73119, 18.35291351, 511.9272679,
186.7270146, 47611.65544), Gene3 = c(1472.348466, 137571.5525,
20.78531289, 3019.140256, 146.9615338, 108935.1303), Gene4 = c(2487.124686,
147604.774, 3.574347972, 1371.576262, 210.6773417, 82831.97458
), Gene5 = c(1872.328747, 235675.6461, 9.834667594, 583.1631957,
120.6931223, 75874.49936), Gene6 = c(1675.724728, 35931.1852,
9.91026361, 1634.038443, 58.04818134, 23502.78972), Gene7 = c(3775.885073,
169672.9921, 5.41305941, 929.2125312, 97.72621248, 46023.7009
), Gene8 = c(5015.202216, 137455.0032, 2.995124554, 1113.882634,
83.17636201, 14048.19237), Gene9 = c(883.5716868, 45920.44167,
6.399646876, 892.313155, 117.1104906, 10825.47974), Gene10 = c(1607.790858,
146627.0588, 1.967559425, 1237.299298, 90.8941744, 32747.04713
), Gene11 = c(2345.478241, 91047.57303, 12.33867961, 663.576224,
384.5839119, 6692.728154), Gene12 = c(2772.362496, 15511.96753,
15.64843017, 4143.085461, 169.545757, 22484.03574), Gene13 = c(4131.51741,
48601.7059, 21.66175797, 2250.0628, 316.0677196, 16612.6508),
Gene14 = c(1252.440598, 54794.36695, 2.925615978, 708.0342528,
211.822519, 14021.28425)), row.names = c(NA, 6L), class = "data.frame")