0
votes

I have a set of 520 influenza sequences for which I have already done multiple sequence alignment, and computed the pairwise identity matrix. If I'd like to add in another sequence, I have to re-align everything, and recompute the entire PWI matrix. Is there any program I can use to "append" this other sequence to the alignment, and only compute the PWI w.r.t. every other sequence?

A simple example would be as follows. I have a 2x2 alignment, with the following scores.

     SeqA SeqB
SeqA 1.00 0.98
SeqB 0.98 1.00

Without re-running a full alignment, but only running "SeqC" against all the other sequences, I'd like to get the following matrix:

     SeqA SeqB SeqC
SeqA 1.00 0.98 0.99
SeqB 0.98 1.00 0.97
SeqC 0.99 0.97 1.00

I am using the BioPython package, and Python is my preferred language, but I'm okay with Java if need be too.

[I'll disclaim here that I'm cross-posting from BioStars, just in case there's experts here that aren't on BioStars. The BioStars post is: http://www.biostars.org/p/77607/, but it has the exact same content.]

1
can you maybe link to the BioStarts question? - PhiS

1 Answers

1
votes

If your primary issue in the time it takes to re-run alignment (re-computing the PWI matrix should be computationally cheap), then MUSCLE has the capability to do what you're looking for, a procedure commonly termed "profile-profile alignment".

Profile-profile alignment

When passing the -profile flag, the alignments will be "re-aligned to each other, keeping input columns intact and inserting columns of gaps where needed.":

If you have two existing alignments of related sequences you can use the –profile option of MUSCLE to align those two sequences. Typical usage is:

   muscle -profile -in1 one.afa -in2 two.afa -out both.afa

Implementing in Biopython

Biopython has a wrapper around MUSCLE, but I find it easier to just use subprocess to call MUSCLE, then parse the results back into a MultipleSeqAlignment:

# Do profile-profile alignment (one sequence to many aligned)
seq_fn = "influenza_seq.fasta"
aligned_fn = "520_influenza_seqs.afasta"
cmd = ['muscle', '-clwstrict', '-profile', '-in1', seq_fn, '-in2', aligned_fn]
aligner = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = aligner.communicate()

# Get resulting alignment (MultipleSeqAlignment)
alignment =  AlignIO.read(StringIO(stdout), "clustal",
                          alphabet=Alphabet.ProteinAlphabet())