I see!: September 2013

Since I tend to forget how easy it is to install LaTeX packages, here is a very short how-to:

Go to http://www.ctan.org/tex-archive/install/macros/latex/contrib and find the package you want
download the .tds.zip file
change to the folder were your downloaded file was saved to
unzip it to the directory were your texmf is stored, for me (Ubuntu) i.e.: /home/username/texmf/
update tex, in the terminal, type: texhash ~/texmf/
Done!

Or, in the terminal with the glossaries package as an example:

cd ~/Downloads/

wget http://www.ctan.org/tex-archive/install/macros/latex/contrib/glossaries.tds.zip

unzip ~/Downloads/glossaries.tds.zip -d ~/texmf/

texhash ~/texmf/

Note: This way, the new package is only installed for you, not for all users!

The reason why I wrote the following Script is that I wanted to align a couple of sequences which contained repeats. Not surprisingly the alignment muscle produced did not look very nice, due to those repeats.

Unfortunately, one cannot tell muscle to ignore all small caps letters (afaik), hence I replaced all small caps by 'N' s. Making again an alignment produced a much better result. However, my sequences still contain the 'N's, of course. This is where the small Script kicks in: It replaces all N's with the character that is supposed to be found at this position.

Here is the code of the Script:

#! /usr/bin/python

"""
Script to "unmask" an aligned sequence.

Takes an alignment file of masked sequences and a file with the unmasked sequences
and returns an alignment file with unmasked sequences.

Written by: Tabea Hoehmann
"""

import sys
from textwrap import wrap

def read_fasta_file(fasta_filename):
"""
Read a fasta file and save the sequences to a dictionary.
Returns a dictionary with the sequence id as key and the
sequence as value.
"""
seq={}
fasta_filehandle = open(fasta_filename, 'r')
i=0
for line in fasta_filehandle.readlines():
if '>' in line:
name = line.strip().replace('>','').upper() ### !!!! Only the first element !!!!
seq[name] = ""
else:
seq[name] += line.strip()
if not set(line.strip()).issubset(set('actgkrynACTGKRYN-')):
print("Unknown character(s) in line %s of sequence file %s: %s\nScript proceeds as if nothing happened..."%(i, fasta_filename, ", ".join(set(line)-set('actgkrynACTGKRYN-'))))
i+=1
fasta_filehandle.close()
return(seq)

# Read fasta files
try:
masked_aln_fasta = read_fasta_file(sys.argv[1])
unmasked_fasta = read_fasta_file(sys.argv[2])
except IndexError:
print("Usage: unmask_fasta_aln.py file1 file 2")
print("file1: The alignment file that is masked by N's")
print("file1: The fasta file that is not masked")
sys.exit()

# New dictionary to store edited sequences
new_unmasked_aln_fasta = {}

# Iterate over every sequence, replace all N's by nucleotides
for masked_id, masked_seq in masked_aln_fasta.items():
unmasked_seq = unmasked_fasta[masked_id]
new_unmasked_aln_seq = ""
nt_pos = 0
for nt in masked_seq:
if nt not in 'nN':
new_unmasked_aln_seq += nt
else:
new_unmasked_aln_seq += unmasked_seq[nt_pos]
if nt != '-':
nt_pos+=1
new_unmasked_aln_fasta[masked_id] = new_unmasked_aln_seq

# Print result to stdout
for id, seq in new_unmasked_aln_fasta.items():
print('>%s\n%s'%(id, '\n'.join(wrap(seq, 100, break_on_hyphens=False))))

I see!

2013-09-10

How to install LaTeX packages

2013-09-05

Create a Multiple Sequence Alignment with masked parts and replace masked parts by propper sequence

Labels