I see!: July 2013

2013-07-29

Use cut to extract a certain text range from a one-line DNA sequence

I wanted to get the characters at position 27947 to 28203 from a text file containing just one DNA sequence as a very long line. This is the cut command that worked:

cut -c 27947-28203 file.txt

2013-07-19

How to automatically create Venn diagram for five sets

Wikipedia has a very nice Venn diagram for five sets. I use this svg file to automatically create a Venn diagram displaying the number of intersecting elements of five sets.

Number of intersecting elements

For this we need the number of intersecting elements for all possible set combinations, in my example it looks like this:

numbers.txt

A 13644

B 14729

C 14690

D 13725

E 13742

AB 3689

AC 3616

AD 3523

AE 3496

BC 14281

BD 12852

BE 12694

CD 13215

CE 13060

DE 13563

ABC 3609

ABD 3507

ABE 3480

ACD 3513

ACE 3487

ADE 3496

BCD 12849

BCE 12694

BDE 12694

CDE 13056

ABCD 3506

ABCE 3480

ABDE 3480

ACDE 3487

BCDE 12694

ABCDE 3480

To get these numbers I used a small R script and formatted the output with a little bit of TextWrangler and grep to look like in the file numbers.txt

Here is the short R script:

A <- read.delim('Info1.csv')$name

B <- read.delim('Info2.csv')$name

C <- read.delim('Info3.csv')$name

D <- read.delim('Info4.csv')$name

E <- read.delim('Info5.csv')$name

# all sets

print(paste("A",length(A)))

print(paste("B",length(B)))

print(paste("C",length(C)))

print(paste("D",length(D)))

print(paste("E",length(E)))

# all combinations of two sets

print(paste("A, B", length(intersect(A, B))))

print(paste("A, C", length(intersect(A, C))))

print(paste("A, D", length(intersect(A, D))))

print(paste("A, E", length(intersect(A, E))))

print(paste("B, C", length(intersect(B, C))))

print(paste("B, D", length(intersect(B, D))))

print(paste("B, E", length(intersect(B, E))))

print(paste("C, D", length(intersect(C, D))))

print(paste("C, E", length(intersect(C, E))))

print(paste("D, E", length(intersect(D, E))))

# all combinations of three sets

print(paste("A, B, C", length(intersect(A, intersect(B, C)))))

print(paste("A, B, D", length(intersect(A, intersect(B, D)))))

print(paste("A, B, E", length(intersect(A, intersect(B, E)))))

print(paste("A, C, D", length(intersect(A, intersect(C, D)))))

print(paste("A, C, E", length(intersect(A, intersect(C, E)))))

print(paste("A, D, E", length(intersect(A, intersect(D, E)))))

print(paste("B, C, D", length(intersect(B, intersect(C, D)))))

print(paste("B, C, E", length(intersect(B, intersect(C, E)))))

print(paste("B, D, E", length(intersect(B, intersect(D, E)))))

print(paste("C, D, E", length(intersect(C, intersect(D, E)))))

# all combinations of four sets

print(paste("A, B, C, D", length(intersect(A, intersect(B, intersect(C, D))))))

print(paste("A, B, C, E", length(intersect(A, intersect(B, intersect(C, E))))))

print(paste("A, B, D, E", length(intersect(A, intersect(B, intersect(D, E))))))

print(paste("A, C, D, E", length(intersect(A, intersect(C, intersect(D, E))))))

print(paste("B, C, D, E", length(intersect(B, intersect(C, intersect(D, E))))))

# combination of five sets

print(paste("ABCDE", length(intersect(A, intersect(B, intersect(C, intersect(D, E)))))))

Getting these numbers into the diagram

Since the svg file only consists of 51 lines, I display it here:

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">

<title>Radially-symmetrical Five-set Venn Diagram</title>

<desc>Devised by Branko Gruenbaum and rendered by CMG Lee.</desc>

<defs>

</g>

</defs>

<text x="0" y="0" dy="0.7ex">ABCDE</text>

</g>

</svg>

You can see, that the names of the different overlapping fields are annotated as "A", "B", "AB", and so forth. Hence, I can just replace them with sed, to display my own data!

This I do with the following bash/sed command:

while read LINE

NAME=$( echo $LINE | awk '{print $1 }')

NUMB=$( echo $LINE | awk '{print $2 }')

echo $NAME $NUMB

sed -i tmp "s/\>${NAME}\</\>${NUMB}\</g" venn.svg

echo "sed -i tmp "s/\>\${NAME}\</\>\${NUMB}\</g" venn.svg"

done < 'numbers.txt'

The altered Venn diagram

The output is indeed exactly what I wanted:

Of course, I still need to adapt the font size, but thanks to sed, this won't be a big issue.

How to convert fasta file to uppercase/lowercase, without altering the annotation line

This awk-oneliner converts all lines which do not contain a '>' in the beginning to upper case:

$ awk '{ if ($0 !~ />/) {print toupper($0)} else {print $0} }' name.fasta

How to see stdout output in the terminal and pipe it into a file

Since I always have to look up how to print the stdout output to the screen AND to a file at the same time, I put this small command here:

command 2>&1 | tee outputfile.txt

How to convert .maf alignment format to a RepeatMasker-output-file-like format

The following python script takes a .maf alignment file, which looks for example like this:

# lastz.v1.02.00 --format=maf --output=NM_144963.2-3.maf
#
# hsp_threshold = 3000
# gapped_threshold = 3000
# x_drop = 910
# y_drop = 9400
# gap_open_penalty = 400
# gap_extend_penalty = 30
# A C G T
# A 91 -114 -31 -123
# C -114 100 -125 -31
# G -31 -125 100 -114
# T -123 -31 -114 91
a score=11361
s target 0 133 + 5344 GGCGGGAGGCTTAGGGTGGGTCGCGGTCGCGGCCCCCGCCGCAGGCCCTGGCGGCGGCATCGTGAACATAGACGTGGAGTTCCACATCCGGCACAACTACCCCTGGAACAAGTTGCCGGCCAACGTGAGGCAG
s query 0 133 + 46809 GGCGGGAGGCGTAGTGTGGGTCGCGGTGGCGGCCCCGGCCGCCGGCCCTGGCCGCGGCATCATGAACATAGACGTGGAGTTCCACATCCGGCACAACTACCCCTGGAACAAGTTGCCGGCCAACGTGAGACAG

a score=7552
s target 207 151 + 5344 tacaggaagtattggtcttgctgttcagagtggcagaggagaaaga--------aaatcccagcC---GTCTTGGAAATTCACAGAGAGAATATGAAAATCAGGTTGTCCTGTACAGTATCTGCAATCAGTTACGATACAGAAATAACTTAAACATGTCAAG
s query 5371 162 + 46809 tatgggaaaaaatagtactattgttcagtttggaagttgtttaaaatctttttgaaatttctcctagAGTCTTGGAAATTCACAGAGAGAATATGAAAAGCAGGTTGTCCTGTACAGTATCCGCAATCAGTTACGATATAGAAATAACTTAGgtaagtagag

a score=13285
s target 344 152 + 5344 TTAAACATGTCAAGAAAGATGAACGCAGATACTATGAGGAACTGCTAAACTACAGCCGAGATCATCTCATGCTGTACCCTTACCGTCTATTGGATATTATGGTGAAGAGCTTGAGCATAACACCATTTTCATATTACACTGGGATTATGGAG
s query 6505 152 + 46809 TTAAACATGTCAAGAAAGATGAACGCAGATACTATGAGGAACTGCTAAAGTACAGCCGAGATCATCTCATGCTGTACCCTTACCATCTATCGGATATTATGGTGAAAGGCTTGAGGATAACACCATTTTCATATTATACTGGGATTATGGAG

a score=6190
s target 536 68 + 5344 GTCTAAGGCTTCTTGGCATAGGAAGAAACCAGTATATTGATCTTATGATTCAGTGTAGATCATCAAAA
s query 9383 68 + 46809 GTCTAAGGCTTCTTGGCATAGGAAGAAACCAGTATATTGATCTTATGAATCAGTGTAGATCATCAAAA

and converts it into a RepeatMasker-output-like format that looks like this:
SW subst del_in_retro insert_in_retro query qstart qend qleft compl match match_class mstart mend mleft id
11361 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 0 133 5211 +,+ hg19_refGene_NM_144963 nA 0 133 46676 1
7552 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 207 358 4986 +,+ hg19_refGene_NM_144963 nA 5371 5533 41276 2
13285 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 344 496 4848 +,+ hg19_refGene_NM_144963 nA 6505 6657 40152 3
6190 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 536 604 4740 +,+ hg19_refGene_NM_144963 nA 9383 9451 37358 4

I save also the output that will not go into the new file into variables, in case I maybe want to include it as well.

import sys
INFILENAME=sys.argv[1]

WITH_HEADER=False
try:
WITH_HEADER=sys.argv[2]
except:
pass

INFILE = open(INFILENAME, 'r')

ALNS=[]
i=0
for LINE in INFILE.readlines():
if 'a score=' in LINE:
SCORE=LINE.split('=')[1].strip()
i+=1
QUERY_PASSED=False
MATCH_PASSED=False
elif LINE.startswith('s') and not QUERY_PASSED:
MUERY, MSTART, MSIZE, MSTRAND, MSRCSIZE, MSEQUENCE = LINE.split()[1:]
MEND = str(int(MSTART) + int(MSIZE))
MLEFT = str(int(MSRCSIZE) - int(MEND))
MUERY_PASSED = True
elif LINE.startswith('s') and QUERY_PASSED:
QATCH, QSTART, QSIZE, QSTRAND, QSRCSIZE, QSEQUENCE = LINE.split()[1:]
QEND = str(int(QSTART) + int(QSIZE))
QLEFT = str(int(QSRCSIZE) - int(QEND))
QATCH_PASSED=True
ALNS.append((SCORE, 'nA', 'nA', 'nA', QUERY, QSTART, QEND, QLEFT, MSTRAND, MATCH, 'nA', MSTART, MEND, MLEFT, str(i)))

INFILE.close()

if WITH_HEADER:
print("SW subst del_in_retro insert_in_retro query qstart qend qleft compl match match_class mstart mend mleft id")
for LINE in ALNS:
print('\t'.join(LINE))