If a BED file looks for example like this
chr1 11873 14361 AM992877 0 + 11873 14361 0 3 354,109,1141, 0,739,1347,
chr1 11873 14361 AM992881 0 + 11873 14361 0 3 354,127,959, 0,721,1529,
chr1 11873 14362 AM992878 0 + 11873 14362 0 5 354,52,436,299,404, 0,772,1347,1785,2085,
chr1 11873 14362 AM992879 0 + 11873 14362 0 3 354,109,1142, 0,739,1347,
chr1 11873 14409 AM992871 0 + 11873 14409 0 3 354,109,1189, 0,739,1347,
it contains one line for every mRNA (in this case), giving the mRNA start and end coordinates in the first three columns. However, the blocks' coordinates are given in the last two columns and if you use a tool such as, e.g., Bedtools, you can only find results for the whole length of the mRNA.
I wanted to get results based on the individual blocks of the mRNA.
To get these I wrote a script that turns the above BED file into this format:
chr1 11873 12227 AM992877 0 + 11873 14361 0 3 354,109,1141, 0,739,1347,
chr1 12612 12721 AM992877 0 + 11873 14361 0 3 354,109,1141, 0,739,1347,
chr1 13220 14361 AM992877 0 + 11873 14361 0 3 354,109,1141, 0,739,1347,
chr1 11873 12227 AM992881 0 + 11873 14361 0 3 354,127,959, 0,721,1529,
chr1 12594 12721 AM992881 0 + 11873 14361 0 3 354,127,959, 0,721,1529,
chr1 13402 14361 AM992881 0 + 11873 14361 0 3 354,127,959, 0,721,1529,
chr1 11873 12227 AM992878 0 + 11873 14362 0 5 354,52,436,299,404, 0,772,1347,1785,2085,
chr1 12645 12697 AM992878 0 + 11873 14362 0 5 354,52,436,299,404, 0,772,1347,1785,2085,
chr1 13220 13656 AM992878 0 + 11873 14362 0 5 354,52,436,299,404, 0,772,1347,1785,2085,
chr1 13658 13957 AM992878 0 + 11873 14362 0 5 354,52,436,299,404, 0,772,1347,1785,2085,
chr1 13958 14362 AM992878 0 + 11873 14362 0 5 354,52,436,299,404, 0,772,1347,1785,2085,
chr1 11873 12227 AM992879 0 + 11873 14362 0 3 354,109,1142, 0,739,1347,
chr1 12612 12721 AM992879 0 + 11873 14362 0 3 354,109,1142, 0,739,1347,
chr1 13220 14362 AM992879 0 + 11873 14362 0 3 354,109,1142, 0,739,1347,
chr1 11873 12227 AM992871 0 + 11873 14409 0 3 354,109,1189, 0,739,1347,
chr1 12612 12721 AM992871 0 + 11873 14409 0 3 354,109,1189, 0,739,1347,
chr1 13220 14409 AM992871 0 + 11873 14409 0 3 354,109,1189, 0,739,1347,
This is the script, which works nice, if the BED file has the exact same format as in the example:
import sys
infile = sys.argv[1]
infile_handle = open(infile)
for line in infile_handle.readlines():
chr, chr_start, chr_end, name, score, strand, thickStart, \
thickEnd, itemRGB, blockCount, blockSizes, blockStarts = \
line.split('\t')
chr_start = int(chr_start)
chr_end = int(chr_end)
for i, blockSize in enumerate(blockSizes.split(',')[:-1]):
blockSize = int(blockSize)
blockStart = int(blockStarts.split(',')[:-1][i])
s = "\t".join([chr, str(chr_start + blockStart), str(chr_start + blockStart + blockSize), name, score, strand, thickStart, thickEnd, itemRGB, blockCount, blockSizes, blockStarts])
print s,
No comments:
Post a Comment