I wanted to get the characters at position 27947 to 28203 from a text file containing just one DNA sequence as a very long line. This is the cut command that worked:
cut -c 27947-28203 file.txt
2013-07-19
How to automatically create Venn diagram for five sets
Wikipedia has a very nice Venn diagram for five sets. I use this svg file to automatically create a Venn diagram displaying the number of intersecting elements of five sets.
Number of intersecting elements
For this we need the number of intersecting elements for all possible set combinations, in my example it looks like this:
numbers.txt
A 13644
B 14729
C 14690
D 13725
E 13742
AB 3689
AC 3616
AD 3523
AE 3496
BC 14281
BD 12852
BE 12694
CD 13215
CE 13060
DE 13563
ABC 3609
ABD 3507
ABE 3480
ACD 3513
ACE 3487
ADE 3496
BCD 12849
BCE 12694
BDE 12694
CDE 13056
ABCD 3506
ABCE 3480
ABDE 3480
ACDE 3487
BCDE 12694
ABCDE 3480
To get these numbers I used a small R script and formatted the output with a little bit of TextWrangler and grep to look like in the file numbers.txt
Here is the short R script:
A <- read.delim('Info1.csv')$name
B <- read.delim('Info2.csv')$name
C <- read.delim('Info3.csv')$name
D <- read.delim('Info4.csv')$name
E <- read.delim('Info5.csv')$name
# all sets
print(paste("A",length(A)))
print(paste("B",length(B)))
print(paste("C",length(C)))
print(paste("D",length(D)))
print(paste("E",length(E)))
# all combinations of two sets
print(paste("A, B", length(intersect(A, B))))
print(paste("A, C", length(intersect(A, C))))
print(paste("A, D", length(intersect(A, D))))
print(paste("A, E", length(intersect(A, E))))
print(paste("B, C", length(intersect(B, C))))
print(paste("B, D", length(intersect(B, D))))
print(paste("B, E", length(intersect(B, E))))
print(paste("C, D", length(intersect(C, D))))
print(paste("C, E", length(intersect(C, E))))
print(paste("D, E", length(intersect(D, E))))
# all combinations of three sets
print(paste("A, B, C", length(intersect(A, intersect(B, C)))))
print(paste("A, B, D", length(intersect(A, intersect(B, D)))))
print(paste("A, B, E", length(intersect(A, intersect(B, E)))))
print(paste("A, C, D", length(intersect(A, intersect(C, D)))))
print(paste("A, C, E", length(intersect(A, intersect(C, E)))))
print(paste("A, D, E", length(intersect(A, intersect(D, E)))))
print(paste("B, C, D", length(intersect(B, intersect(C, D)))))
print(paste("B, C, E", length(intersect(B, intersect(C, E)))))
print(paste("B, D, E", length(intersect(B, intersect(D, E)))))
print(paste("C, D, E", length(intersect(C, intersect(D, E)))))
# all combinations of four sets
print(paste("A, B, C, D", length(intersect(A, intersect(B, intersect(C, D))))))
print(paste("A, B, C, E", length(intersect(A, intersect(B, intersect(C, E))))))
print(paste("A, B, D, E", length(intersect(A, intersect(B, intersect(D, E))))))
print(paste("A, C, D, E", length(intersect(A, intersect(C, intersect(D, E))))))
print(paste("B, C, D, E", length(intersect(B, intersect(C, intersect(D, E))))))
# combination of five sets
print(paste("ABCDE", length(intersect(A, intersect(B, intersect(C, intersect(D, E)))))))
Getting these numbers into the diagram
Since the svg file only consists of 51 lines, I display it here:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="746" height="742" viewBox="-362 -388 746 742">
<title>Radially-symmetrical Five-set Venn Diagram</title>
<desc>Devised by Branko Gruenbaum and rendered by CMG Lee.</desc>
<defs>
<ellipse id="ellipse" cx="36" cy="-56" rx="160" ry="320" />
<g id="ellipses">
<use xlink:href="#ellipse" fill="#0000ff" />
<use xlink:href="#ellipse" fill="#0099ff" transform="rotate(72)" />
<use xlink:href="#ellipse" fill="#00cc00" transform="rotate(144)" />
<use xlink:href="#ellipse" fill="#cc9900" transform="rotate(216)" />
<use xlink:href="#ellipse" fill="#ff0000" transform="rotate(288)" />
</g>
</defs>
<use xlink:href="#ellipses" fill-opacity="0.3" />
<use xlink:href="#ellipses" fill-opacity="0" stroke="#000" stroke-width="2" />
<g text-anchor="middle" font-family="sans-serif" font-size="16">
<text x="30" y="-300" dy="0.7ex" font-size="64">A</text>
<text x="300" y="-60" dy="0.7ex" font-size="64">B</text>
<text x="160" y="280" dy="0.7ex" font-size="64">C</text>
<text x="-220" y="220" dy="0.7ex" font-size="64">D</text>
<text x="-280" y="-130" dy="0.7ex" font-size="64">E</text>
<text x="180" y="-130" dy="0.7ex">AB</text>
<text x="40" y="230" dy="0.7ex">AC</text>
<text x="100" y="-200" dy="0.7ex">AD</text>
<text x="-80" y="-215" dy="0.7ex">AE</text>
<text x="190" y="125" dy="0.7ex">BC</text>
<text x="-190" y="120" dy="0.7ex">BD</text>
<text x="230" y="40" dy="0.7ex">BE</text>
<text x="-60" y="220" dy="0.7ex">CD</text>
<text x="-170" y="-150" dy="0.7ex">CE</text>
<text x="-222" y="0" dy="0.7ex">DE</text>
<text x="90" y="150" dy="0.7ex">ABC</text>
<text x="148" y="-153" dy="0.7ex" font-size="14">ABD</text>
<text x="170" y="-20" dy="0.7ex">ABE</text>
<text x="-33" y="208" dy="0.7ex" font-size="14">ACD</text>
<text x="-93" y="-193" dy="0.7ex" font-size="14">ACE</text>
<text x="20" y="-180" dy="0.7ex">ADE</text>
<text x="-120" y="120" dy="0.7ex">BCD</text>
<text x="190" y="100" dy="0.7ex" font-size="14">BCE</text>
<text x="-211" y="32" dy="0.7ex" font-size="14">BDE</text>
<text x="-150" y="-80" dy="0.7ex">CDE</text>
<text x="-30" y="160" dy="0.7ex">ABCD</text>
<text x="140" y="80" dy="0.7ex">ABCE</text>
<text x="120" y="-100" dy="0.7ex">ABDE</text>
<text x="-60" y="-140" dy="0.7ex">ACDE</text>
<text x="-160" y="20" dy="0.7ex">BCDE</text>
<text x="0" y="0" dy="0.7ex">ABCDE</text>
</g>
</svg>
You can see, that the names of the different overlapping fields are annotated as "A", "B", "AB", and so forth. Hence, I can just replace them with sed, to display my own data!
This I do with the following bash/sed command:
while read LINE
do
NAME=$( echo $LINE | awk '{print $1 }')
NUMB=$( echo $LINE | awk '{print $2 }')
echo $NAME $NUMB
sed -i tmp "s/\>${NAME}\</\>${NUMB}\</g" venn.svg
echo "sed -i tmp "s/\>\${NAME}\</\>\${NUMB}\</g" venn.svg"
done < 'numbers.txt'
The altered Venn diagram
The output is indeed exactly what I wanted:
Of course, I still need to adapt the font size, but thanks to sed, this won't be a big issue.
How to convert fasta file to uppercase/lowercase, without altering the annotation line
This awk-oneliner converts all lines which do not contain a '>' in the beginning to upper case:
$ awk '{ if ($0 !~ />/) {print toupper($0)} else {print $0} }' name.fasta
$ awk '{ if ($0 !~ />/) {print toupper($0)} else {print $0} }' name.fasta
How to see stdout output in the terminal and pipe it into a file
Since I always have to look up how to print the stdout output to the screen AND to a file at the same time, I put this small command here:
command 2>&1 | tee outputfile.txt
command 2>&1 | tee outputfile.txt
How to convert .maf alignment format to a RepeatMasker-output-file-like format
The following python script takes a .maf alignment file, which looks for example like this:
# lastz.v1.02.00 --format=maf --output=NM_144963.2-3.maf
#
# hsp_threshold = 3000
# gapped_threshold = 3000
# x_drop = 910
# y_drop = 9400
# gap_open_penalty = 400
# gap_extend_penalty = 30
# A C G T
# A 91 -114 -31 -123
# C -114 100 -125 -31
# G -31 -125 100 -114
# T -123 -31 -114 91
a score=11361
s target 0 133 + 5344 GGCGGGAGGCTTAGGGTGGGTCGCGGTCGCGGCCCCCGCCGCAGGCCCTGGCGGCGGCATCGTGAACATAGACGTGGAGTTCCACATCCGGCACAACTACCCCTGGAACAAGTTGCCGGCCAACGTGAGGCAG
s query 0 133 + 46809 GGCGGGAGGCGTAGTGTGGGTCGCGGTGGCGGCCCCGGCCGCCGGCCCTGGCCGCGGCATCATGAACATAGACGTGGAGTTCCACATCCGGCACAACTACCCCTGGAACAAGTTGCCGGCCAACGTGAGACAG
a score=7552
s target 207 151 + 5344 tacaggaagtattggtcttgctgttcagagtggcagaggagaaaga--------aaatcccagcC---GTCTTGGAAATTCACAGAGAGAATATGAAAATCAGGTTGTCCTGTACAGTATCTGCAATCAGTTACGATACAGAAATAACTTAAACATGTCAAG
s query 5371 162 + 46809 tatgggaaaaaatagtactattgttcagtttggaagttgtttaaaatctttttgaaatttctcctagAGTCTTGGAAATTCACAGAGAGAATATGAAAAGCAGGTTGTCCTGTACAGTATCCGCAATCAGTTACGATATAGAAATAACTTAGgtaagtagag
a score=13285
s target 344 152 + 5344 TTAAACATGTCAAGAAAGATGAACGCAGATACTATGAGGAACTGCTAAACTACAGCCGAGATCATCTCATGCTGTACCCTTACCGTCTATTGGATATTATGGTGAAGAGCTTGAGCATAACACCATTTTCATATTACACTGGGATTATGGAG
s query 6505 152 + 46809 TTAAACATGTCAAGAAAGATGAACGCAGATACTATGAGGAACTGCTAAAGTACAGCCGAGATCATCTCATGCTGTACCCTTACCATCTATCGGATATTATGGTGAAAGGCTTGAGGATAACACCATTTTCATATTATACTGGGATTATGGAG
a score=6190
s target 536 68 + 5344 GTCTAAGGCTTCTTGGCATAGGAAGAAACCAGTATATTGATCTTATGATTCAGTGTAGATCATCAAAA
s query 9383 68 + 46809 GTCTAAGGCTTCTTGGCATAGGAAGAAACCAGTATATTGATCTTATGAATCAGTGTAGATCATCAAAA
and converts it into a RepeatMasker-output-like format that looks like this:
SW subst del_in_retro insert_in_retro query qstart qend qleft compl match match_class mstart mend mleft id
11361 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 0 133 5211 +,+ hg19_refGene_NM_144963 nA 0 133 46676 1
7552 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 207 358 4986 +,+ hg19_refGene_NM_144963 nA 5371 5533 41276 2
13285 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 344 496 4848 +,+ hg19_refGene_NM_144963 nA 6505 6657 40152 3
6190 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 536 604 4740 +,+ hg19_refGene_NM_144963 nA 9383 9451 37358 4
I save also the output that will not go into the new file into variables, in case I maybe want to include it as well.
import sys
INFILENAME=sys.argv[1]
WITH_HEADER=False
try:
WITH_HEADER=sys.argv[2]
except:
pass
INFILE = open(INFILENAME, 'r')
ALNS=[]
i=0
for LINE in INFILE.readlines():
if 'a score=' in LINE:
SCORE=LINE.split('=')[1].strip()
i+=1
QUERY_PASSED=False
MATCH_PASSED=False
elif LINE.startswith('s') and not QUERY_PASSED:
MUERY, MSTART, MSIZE, MSTRAND, MSRCSIZE, MSEQUENCE = LINE.split()[1:]
MEND = str(int(MSTART) + int(MSIZE))
MLEFT = str(int(MSRCSIZE) - int(MEND))
MUERY_PASSED = True
elif LINE.startswith('s') and QUERY_PASSED:
QATCH, QSTART, QSIZE, QSTRAND, QSRCSIZE, QSEQUENCE = LINE.split()[1:]
QEND = str(int(QSTART) + int(QSIZE))
QLEFT = str(int(QSRCSIZE) - int(QEND))
QATCH_PASSED=True
ALNS.append((SCORE, 'nA', 'nA', 'nA', QUERY, QSTART, QEND, QLEFT, MSTRAND, MATCH, 'nA', MSTART, MEND, MLEFT, str(i)))
INFILE.close()
if WITH_HEADER:
print("SW subst del_in_retro insert_in_retro query qstart qend qleft compl match match_class mstart mend mleft id")
for LINE in ALNS:
print('\t'.join(LINE))
# lastz.v1.02.00 --format=maf --output=NM_144963.2-3.maf
#
# hsp_threshold = 3000
# gapped_threshold = 3000
# x_drop = 910
# y_drop = 9400
# gap_open_penalty = 400
# gap_extend_penalty = 30
# A C G T
# A 91 -114 -31 -123
# C -114 100 -125 -31
# G -31 -125 100 -114
# T -123 -31 -114 91
a score=11361
s target 0 133 + 5344 GGCGGGAGGCTTAGGGTGGGTCGCGGTCGCGGCCCCCGCCGCAGGCCCTGGCGGCGGCATCGTGAACATAGACGTGGAGTTCCACATCCGGCACAACTACCCCTGGAACAAGTTGCCGGCCAACGTGAGGCAG
s query 0 133 + 46809 GGCGGGAGGCGTAGTGTGGGTCGCGGTGGCGGCCCCGGCCGCCGGCCCTGGCCGCGGCATCATGAACATAGACGTGGAGTTCCACATCCGGCACAACTACCCCTGGAACAAGTTGCCGGCCAACGTGAGACAG
a score=7552
s target 207 151 + 5344 tacaggaagtattggtcttgctgttcagagtggcagaggagaaaga--------aaatcccagcC---GTCTTGGAAATTCACAGAGAGAATATGAAAATCAGGTTGTCCTGTACAGTATCTGCAATCAGTTACGATACAGAAATAACTTAAACATGTCAAG
s query 5371 162 + 46809 tatgggaaaaaatagtactattgttcagtttggaagttgtttaaaatctttttgaaatttctcctagAGTCTTGGAAATTCACAGAGAGAATATGAAAAGCAGGTTGTCCTGTACAGTATCCGCAATCAGTTACGATATAGAAATAACTTAGgtaagtagag
a score=13285
s target 344 152 + 5344 TTAAACATGTCAAGAAAGATGAACGCAGATACTATGAGGAACTGCTAAACTACAGCCGAGATCATCTCATGCTGTACCCTTACCGTCTATTGGATATTATGGTGAAGAGCTTGAGCATAACACCATTTTCATATTACACTGGGATTATGGAG
s query 6505 152 + 46809 TTAAACATGTCAAGAAAGATGAACGCAGATACTATGAGGAACTGCTAAAGTACAGCCGAGATCATCTCATGCTGTACCCTTACCATCTATCGGATATTATGGTGAAAGGCTTGAGGATAACACCATTTTCATATTATACTGGGATTATGGAG
a score=6190
s target 536 68 + 5344 GTCTAAGGCTTCTTGGCATAGGAAGAAACCAGTATATTGATCTTATGATTCAGTGTAGATCATCAAAA
s query 9383 68 + 46809 GTCTAAGGCTTCTTGGCATAGGAAGAAACCAGTATATTGATCTTATGAATCAGTGTAGATCATCAAAA
and converts it into a RepeatMasker-output-like format that looks like this:
SW subst del_in_retro insert_in_retro query qstart qend qleft compl match match_class mstart mend mleft id
11361 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 0 133 5211 +,+ hg19_refGene_NM_144963 nA 0 133 46676 1
7552 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 207 358 4986 +,+ hg19_refGene_NM_144963 nA 5371 5533 41276 2
13285 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 344 496 4848 +,+ hg19_refGene_NM_144963 nA 6505 6657 40152 3
6190 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 536 604 4740 +,+ hg19_refGene_NM_144963 nA 9383 9451 37358 4
import sys
INFILENAME=sys.argv[1]
WITH_HEADER=False
try:
WITH_HEADER=sys.argv[2]
except:
pass
INFILE = open(INFILENAME, 'r')
ALNS=[]
i=0
for LINE in INFILE.readlines():
if 'a score=' in LINE:
SCORE=LINE.split('=')[1].strip()
i+=1
QUERY_PASSED=False
MATCH_PASSED=False
elif LINE.startswith('s') and not QUERY_PASSED:
MUERY, MSTART, MSIZE, MSTRAND, MSRCSIZE, MSEQUENCE = LINE.split()[1:]
MEND = str(int(MSTART) + int(MSIZE))
MLEFT = str(int(MSRCSIZE) - int(MEND))
MUERY_PASSED = True
elif LINE.startswith('s') and QUERY_PASSED:
QATCH, QSTART, QSIZE, QSTRAND, QSRCSIZE, QSEQUENCE = LINE.split()[1:]
QEND = str(int(QSTART) + int(QSIZE))
QLEFT = str(int(QSRCSIZE) - int(QEND))
QATCH_PASSED=True
ALNS.append((SCORE, 'nA', 'nA', 'nA', QUERY, QSTART, QEND, QLEFT, MSTRAND, MATCH, 'nA', MSTART, MEND, MLEFT, str(i)))
INFILE.close()
print("SW subst del_in_retro insert_in_retro query qstart qend qleft compl match match_class mstart mend mleft id")
for LINE in ALNS:
print('\t'.join(LINE))
Labels:
alignment,
convert,
converter,
format,
maf,
Multiple alignment,
python,
RepeatMasker
Subscribe to:
Posts (Atom)