I see!: Create a Multiple Sequence Alignment with masked parts and replace masked parts by propper sequence

The reason why I wrote the following Script is that I wanted to align a couple of sequences which contained repeats. Not surprisingly the alignment muscle produced did not look very nice, due to those repeats.

Unfortunately, one cannot tell muscle to ignore all small caps letters (afaik), hence I replaced all small caps by 'N' s. Making again an alignment produced a much better result. However, my sequences still contain the 'N's, of course. This is where the small Script kicks in: It replaces all N's with the character that is supposed to be found at this position.

Here is the code of the Script:

#! /usr/bin/python

"""
Script to "unmask" an aligned sequence.

Takes an alignment file of masked sequences and a file with the unmasked sequences
and returns an alignment file with unmasked sequences.

Written by: Tabea Hoehmann
"""

import sys
from textwrap import wrap

def read_fasta_file(fasta_filename):
"""
Read a fasta file and save the sequences to a dictionary.
Returns a dictionary with the sequence id as key and the
sequence as value.
"""
seq={}
fasta_filehandle = open(fasta_filename, 'r')
i=0
for line in fasta_filehandle.readlines():
if '>' in line:
name = line.strip().replace('>','').upper() ### !!!! Only the first element !!!!
seq[name] = ""
else:
seq[name] += line.strip()
if not set(line.strip()).issubset(set('actgkrynACTGKRYN-')):
print("Unknown character(s) in line %s of sequence file %s: %s\nScript proceeds as if nothing happened..."%(i, fasta_filename, ", ".join(set(line)-set('actgkrynACTGKRYN-'))))
i+=1
fasta_filehandle.close()
return(seq)

# Read fasta files
try:
masked_aln_fasta = read_fasta_file(sys.argv[1])
unmasked_fasta = read_fasta_file(sys.argv[2])
except IndexError:
print("Usage: unmask_fasta_aln.py file1 file 2")
print("file1: The alignment file that is masked by N's")
print("file1: The fasta file that is not masked")
sys.exit()

# New dictionary to store edited sequences
new_unmasked_aln_fasta = {}

# Iterate over every sequence, replace all N's by nucleotides
for masked_id, masked_seq in masked_aln_fasta.items():
unmasked_seq = unmasked_fasta[masked_id]
new_unmasked_aln_seq = ""
nt_pos = 0
for nt in masked_seq:
if nt not in 'nN':
new_unmasked_aln_seq += nt
else:
new_unmasked_aln_seq += unmasked_seq[nt_pos]
if nt != '-':
nt_pos+=1
new_unmasked_aln_fasta[masked_id] = new_unmasked_aln_seq

# Print result to stdout
for id, seq in new_unmasked_aln_fasta.items():
print('>%s\n%s'%(id, '\n'.join(wrap(seq, 100, break_on_hyphens=False))))

I see!

2013-09-05

Create a Multiple Sequence Alignment with masked parts and replace masked parts by propper sequence

No comments:

Post a Comment

Labels