2013-07-19

How to convert .maf alignment format to a RepeatMasker-output-file-like format

The following python script takes a .maf alignment file, which looks for example like this:


# lastz.v1.02.00 --format=maf --output=NM_144963.2-3.maf 
#
# hsp_threshold      = 3000
# gapped_threshold   = 3000
# x_drop             = 910
# y_drop             = 9400
# gap_open_penalty   = 400
# gap_extend_penalty = 30
#        A    C    G    T
#   A   91 -114  -31 -123
#   C -114  100 -125  -31
#   G  -31 -125  100 -114
#   T -123  -31 -114   91
a score=11361
s target 0 133 +  5344 GGCGGGAGGCTTAGGGTGGGTCGCGGTCGCGGCCCCCGCCGCAGGCCCTGGCGGCGGCATCGTGAACATAGACGTGGAGTTCCACATCCGGCACAACTACCCCTGGAACAAGTTGCCGGCCAACGTGAGGCAG
s query           0 133 + 46809 GGCGGGAGGCGTAGTGTGGGTCGCGGTGGCGGCCCCGGCCGCCGGCCCTGGCCGCGGCATCATGAACATAGACGTGGAGTTCCACATCCGGCACAACTACCCCTGGAACAAGTTGCCGGCCAACGTGAGACAG

a score=7552
s target  207 151 +  5344 tacaggaagtattggtcttgctgttcagagtggcagaggagaaaga--------aaatcccagcC---GTCTTGGAAATTCACAGAGAGAATATGAAAATCAGGTTGTCCTGTACAGTATCTGCAATCAGTTACGATACAGAAATAACTTAAACATGTCAAG
s query           5371 162 + 46809 tatgggaaaaaatagtactattgttcagtttggaagttgtttaaaatctttttgaaatttctcctagAGTCTTGGAAATTCACAGAGAGAATATGAAAAGCAGGTTGTCCTGTACAGTATCCGCAATCAGTTACGATATAGAAATAACTTAGgtaagtagag

a score=13285
s target  344 152 +  5344 TTAAACATGTCAAGAAAGATGAACGCAGATACTATGAGGAACTGCTAAACTACAGCCGAGATCATCTCATGCTGTACCCTTACCGTCTATTGGATATTATGGTGAAGAGCTTGAGCATAACACCATTTTCATATTACACTGGGATTATGGAG
s query           6505 152 + 46809 TTAAACATGTCAAGAAAGATGAACGCAGATACTATGAGGAACTGCTAAAGTACAGCCGAGATCATCTCATGCTGTACCCTTACCATCTATCGGATATTATGGTGAAAGGCTTGAGGATAACACCATTTTCATATTATACTGGGATTATGGAG

a score=6190
s target  536 68 +  5344 GTCTAAGGCTTCTTGGCATAGGAAGAAACCAGTATATTGATCTTATGATTCAGTGTAGATCATCAAAA
s query           9383 68 + 46809 GTCTAAGGCTTCTTGGCATAGGAAGAAACCAGTATATTGATCTTATGAATCAGTGTAGATCATCAAAA


and converts it into a RepeatMasker-output-like format that looks like this:
SW      subst   del_in_retro    insert_in_retro query   qstart  qend    qleft   compl   match   match_class     mstart  mend    mleft   id
11361 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 0 133 5211 +,+ hg19_refGene_NM_144963 nA 0 133 46676 1
7552 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 207 358 4986 +,+ hg19_refGene_NM_144963 nA 5371 5533 41276 2
13285 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 344 496 4848 +,+ hg19_refGene_NM_144963 nA 6505 6657 40152 3
6190 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 536 604 4740 +,+ hg19_refGene_NM_144963 nA 9383 9451 37358 4

I save also the output that will not go into the new file into variables, in case I maybe want to include it as well.




import sys
INFILENAME=sys.argv[1]

WITH_HEADER=False
try:
WITH_HEADER=sys.argv[2]
except:
pass

INFILE = open(INFILENAME, 'r')

ALNS=[]
i=0
for LINE in INFILE.readlines():
if 'a score=' in LINE:
SCORE=LINE.split('=')[1].strip()
i+=1
QUERY_PASSED=False
MATCH_PASSED=False
elif LINE.startswith('s') and not QUERY_PASSED:
   MUERY, MSTART, MSIZE, MSTRAND, MSRCSIZE, MSEQUENCE = LINE.split()[1:]
   MEND = str(int(MSTART) + int(MSIZE))
   MLEFT = str(int(MSRCSIZE) - int(MEND))
   MUERY_PASSED = True
elif LINE.startswith('s') and QUERY_PASSED:
   QATCH, QSTART, QSIZE, QSTRAND, QSRCSIZE, QSEQUENCE = LINE.split()[1:]
   QEND = str(int(QSTART) + int(QSIZE))
   QLEFT = str(int(QSRCSIZE) - int(QEND))
   QATCH_PASSED=True
   ALNS.append((SCORE, 'nA', 'nA', 'nA', QUERY, QSTART, QEND, QLEFT, MSTRAND, MATCH, 'nA', MSTART, MEND, MLEFT, str(i)))

INFILE.close()

if WITH_HEADER:
print("SW      subst   del_in_retro    insert_in_retro query   qstart  qend    qleft   compl   match   match_class     mstart  mend    mleft   id")    
for LINE in ALNS:
print('\t'.join(LINE))
 

No comments:

Post a Comment