1
votes

I have many amino acids sequences like in a fasta format that I did multiple sequence alignment. I was trying to plot something like binary code as heatmap. If it had a change it would be red, if it did not change would be yellow, for example.

I came across to msaplot from ggtreepackage. I also checked ggmsa package for that. But so far, I did not get what I wanted. So basically I wanted to:

  1. change the multiple sequence alignment to a binary matrix (if the amino differs from the reference sequence, plot x, if not y)

  2. plot as a heatmap

A multiple sequence alignment is something like this for those who don't know.enter image description here

I know that I should provide some type of data example but I am not sure how to create an example of multiple sequence alignment

if you install ggmsa you can have an example of the data and plot in r using:

protein_sequences <- system.file("extdata", "sample.fasta", package = "ggmsa")
ggmsa(protein_sequences, start = 265, end = 300) 
1
i am not so familiar with the files for sequence alignment, the plot should be pretty easy. You wanna provide an example, for lets say a 3 protein, 20 AA ? - StupidWolf
or you ask your question in bioinformatics.stackexchange.com/users maybe there are experts on that - StupidWolf
@StupidWolf I added one script that gives an example of the plot that is in the package ggmsa - rodrigarco
what is a reference sequence? - StupidWolf
Is a standard sequence where the other will be compared. So the first sequence is complete and doens't have "-". The rest of the sequences are aligned based on that, example: ref - ATKV , sequence 1 - A-KT . So, the A and K position is conserved compared to the reference, the "-" shows that is a deletion and the last amino acid changed from V to T. - rodrigarco

1 Answers

2
votes

We read in the alignment:

library(Biostrings)
library(ggmsa)
protein_sequences <- system.file("extdata", "sample.fasta", package = "ggmsa")
aln = readAAMultipleAlignment(protein_sequences)
ggmsa(protein_sequences, start = 265, end = 300) 

enter image description here

Set the reference as the 1st sequence, some Rattus, you can also use the consensus with consensusString() :

aln = unmasked(aln)
names(aln)[1]
[1] "PH4H_Rattus_norvegicus"

ref = aln[1]

Here we iterate through the sequence and make the binary for where the sequences are the same as the reference:

bm = sapply(1:length(aln),function(i){
as.numeric(as.matrix(aln[i])==as.matrix(ref))
})

bm = t(bm)
rownames(bm) = names(aln)

The plot you see above has sequences reversed, so to visualize the same thing we reverse it, and also subset on 265 - 300:

library(pheatmap)
pheatmap(bm[nrow(bm):1,265:300],cluster_rows=FALSE,cluster_cols=FALSE)

enter image description here

The last row, is Rattus, the reference, and anything similar to that is read, as you can see in the alignment above, last 4 sequences are identical.