0
votes

I'm trying to create a rule to assign a specific color code for every unique string for graphing purposes in ggplot2 for different files. For example, if I have two tab delimited files, file1.txt and file2.txt that look like this:

file1.txt

Freq Seq    
90    AAGTGT
3     AAGTGG
3     AAGTCC
2     AATTTT
2     TTTTTT

file2.txt

Freq Seq
91    AAGTGT
4     AAGTGG
2     AAGTCC
2     CCCCCC
1     TTTTTT

There are a total of 6 different colors that will be used for the above files for the 6 different sequences (AAGTGT, AAGTGG, AAGTCC, CCCCCC, TTTTTT, AATTTT). Across my many files, I have ~3000 colors that I've created a palette (pal) for using

pal<-c(randomColor(count=2951))

Is there a method to ensure that all sequences among my many files maintain the ordered pairs of the strings and corresponding hex color codes (i.e. that all files that show the AAGTGT sequence will have the same hex color code for that string)? Of note, not all 3000 colors are represented in each file.

Thanks!

1
There is no way anybody can distinguish 3000 colours. Even if not all 3000 colours are represented in one plot you might end up with 10 colours that are near indistinguishable. I don't understand what you're trying to do.Maurits Evers

1 Answers

2
votes

Hope this helps!

library(ggplot2)
library(randomcoloR)

#build a pallete mapping using 'Seq' column's value in all available dataframes
set.seed(123)
pal <- c(randomColor(count=6))
pal_seq_mapping <- data.frame(sequence=unique(c(as.character(df1$Seq),as.character(df2$Seq))), color=pal)

#example plot on 'df1' dataframe
ggplot(df1, aes(x=Seq, y=Freq)) +
  geom_bar(stat="identity", fill=pal_seq_mapping[match(df1$Seq, pal_seq_mapping$sequence),"color"]) +
  theme_bw()

#example plot on 'df2' dataframe
ggplot(df2, aes(x=Seq, y=Freq)) +
  geom_bar(stat="identity", fill=pal_seq_mapping[match(df2$Seq, pal_seq_mapping$sequence),"color"]) +
  theme_bw()

Output Plot:
Note that color used is same for Seq common in df1 and df2

Output plot

#sample data
> dput(df1)
structure(list(Freq = c(90L, 3L, 3L, 2L, 2L), Seq = structure(c(3L, 
2L, 1L, 4L, 5L), .Label = c("AAGTCC", "AAGTGG", "AAGTGT", "AATTTT", 
"TTTTTT"), class = "factor")), .Names = c("Freq", "Seq"), class = "data.frame", row.names = c(NA, 
-5L))
> dput(df2)
structure(list(Freq = c(91L, 4L, 2L, 2L, 1L), Seq = structure(c(3L, 
2L, 1L, 4L, 5L), .Label = c("AAGTCC", "AAGTGG", "AAGTGT", "CCCCCC", 
"TTTTTT"), class = "factor")), .Names = c("Freq", "Seq"), class = "data.frame", row.names = c(NA, 
-5L))