2
votes

I'm trying to make Gene EXPRESSION PROFILE plot in R. My input data is a data frame where column 1 has gene names and next column2:18 are multiple cancer types. Here is a small set of data.

enter image description here

what I want is to make a graph that has samples on x-axis and from y=axis expression line of each gene. something that looks like this. enter image description here

but instead of timepoints on x-axis it should have Cancer types (columns) so far I've tried ggplot() and geneprofiler() but i failed over and over.

any help will be greatly appreciated.

1
ggplot with geom_line() should work. What have you tried that failed? - Alexlok
hey, My data was in wide format so I was getting issues. Converting it to Long worked for me. - Umer Farooq

1 Answers

3
votes

Data Format

The current format of the data is referred to as wide format, but ggplot requires long format data. The tidyr package (part of the tidyverse) has functions for converting between wide and long formats. In this case, you want the function tidyr::pivot_longer. For example, if you have the data in a data.frame (or tibble) called df_gene_expr, the pivot would go something like

library(tidyverse)

df_gene_expr %>%
  pivot_longer(cols=2:18, names_to="cancer_type", values_to="gene_expr") %>%
  filter(ID == "ABCA8") %>%
  ggplot(aes(x=cancer_type, y=gene_expr)) +
  geom_point()

where here we single out the one gene "ABCA8". Change the geom_point() to whatever geometry you actually want (perhaps geom_bar(stat='identity').


Discrete Trendline

I'm not sure that geom_smooth is entirely appropriate - it is designed with continuous-continuous data in mind. Instead, I'd recommend stat_summary.

There's a slight trick to this because the discrete cancer_type on the x-axis. Namely, the cancer_type variable should be a factor, but we will use the underlying codes for the x-values in stat_summary. Otherwise, it would complain that using a geom='line' doesn't make sense.

Something along the lines:

ggplot(df_long, aes(x=cancer_type, y=gene_expr)) +
  geom_hline(yintercept=0, linetype=4, color="red") +
  geom_line(aes(group=ID), size=0.5, alpha=0.3, color="black") +
  stat_summary(aes(x=as.numeric(cancer_type)), fun=mean, geom='line',
               size=2, color='orange')

Output from Fake Data enter image description here

Technically, this same trick (aes(x=as.numeric(cancer_type))) could be equally-well applied to geom_smooth, but I think it still makes more sense to use the stat_summary which let's one explicitly pick the stat to be computed. For example, perhaps, median instead of mean might be more appropriate in this context for the summary function.