2014-06-11

Overlay many PDFs with each other, for example to add individual page numbers

Some times you have a large number of PDFs you wan to include in a document but that do not have the correct page numbers on them.

To overcome this problem, you can use the command line tool pdftk.

Let's assume you have a 20-page PDF document called "appendix.pdf" to which you want to add page numbers that are in the same layout as in the other pages of your document "thesis.doc", or "thesis.pdf". First, you need to create another 20-page PDF document that contains only the page numbers. You can do this in Word, or LaTeX, or whatever you are using for your original document and save the file as "page_numbers.pdf".
The only thing that is important, is that both documents have the same number of pages.

Now you have the two documents:

  • page_numbers.pdf (20 pages)
  • appendix.pdf (20 pages)


1. Burst pdfs

Use the following command to create two empty folders and the pdftk command to "burst" the PDFs into twenty single page pdf files.

mkdir page_nrs
mkdir appendix
mv page_numbers.pdf page_nrs/
mv appendix.pdf appendix/
cd page_nrs; pdftk page_numbers.pdf burst; cd ..
cd appendix; pdftk appendix.pdf burst; cd ..

You will find, that both folders contain twenty pdf files each with names like this: pg_0001.pdf.

2. overlay pdfs

If you now want to combine/overlay those files, use the following command:

mkdir combined
cd page_nrs
for FILE in pg*pdf
 do
 pdftk $FILE background ../appendix/$FILE output ../combined/$FILE
done
cd ..


3. concatenate overlayed pdfs

Now, each file is overlayed with the corresponding page number. To combine the twenty individual files into one pdf document, use:

cd combined; pdftk p*pdf cat output finished.pdf

Now you are done and can have a look, if everything is correct!



4. Only if pages do not fit: shrink appendix.pdf

If you have the problem, that some pages in your appendix have text at the same position as the page number, you can shrink the pdf content and increase the "border" around the pdf while keeping the same page size with this command:

cd appendix
for FILE in pg*pdf
 do
 gs \
 -dPDFSETTINGS=/default \
 -sDEVICE=pdfwrite \
 -o ../Text_resized/$FILE \
 -dDEVICEWIDTHPOINTS=714 \
 -dDEVICEHEIGHTPOINTS=1010 \
 -dFIXEDMEDIA \
 -c "<</PageOffset [70 50]>> setpagedevice" \
 -f $FILE

done
cd ...

Now you can continue with step number 2, and see if it looks better.
If the pages are still too large, or if they are too small, play around with the numbers marked in orange, to get a different result.

2014-04-23

Compilation of useful commands when dealing with large text files, such as fasta files

# Get line nr. XY from a file in a fast way via sed
sed -n '99999p' myfile.fasta

# Get number of sequences in a fasta file
grep -c '>' myfile.fasta


2014-04-10

Convert UCSC Aln

UCSC 100-Way Genome Alignment, downloaded from their website,
is in strange format, hence:
1. Delete all lines not belonging to alignment
2. Execute RegEx, to include a tab between species name and sequence, but not in between different words in a species name
perl -pe  "s/(([a-zA-Z-\(\)'\.]+\s){1,})([Nacgt=-]+)/\1\t\3/g"
3. Upload file to Galaxy
4. Tabular-To-Fasta Converter
5. Concatenate Fasta by Species

Done!

2014-04-06

Script to compute bp of a band in a gel electrophoresis picture

mm - milimeter from where sample was put into gele
bp - base pair


You need:
  1. Text file with marker sizes and distances (marker.csv)
  2. Text file with distance of your sample (sample.csv)

# R Script file get_bp_from_mm.R:

getbpfrommm <- function(marker, unknown, unknown.filename){
  # log bp of marker
  markerlog <- marker
  markerlog$bp <- log(markerlog$bp)
 
  # make linear model
  marker.lm <- lm(markerlog$bp ~ markerlog$mm)
 
  b=coef(marker.lm)[1]
  m=coef(marker.lm)[2]
 
  unknown.bp <- c()
 
  for(unknown.mm.i in unknown){
    unknown.bp.i = round(exp(m*unknown.mm.i+b))
    unknown.bp <- c(unknown.bp, unknown.bp.i)}
 
  unknown.bp.mm <- cbind(unknown, unknown.bp)
  colnames(unknown.bp.mm) <- c("mm", "bp")
 
  print(unknown.bp.mm)
 
  write.table(unknown.bp.mm, file=paste(unknown.filename, "with_bp.csv", sep="_"), quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE)
}

#read command line arguments
args <- commandArgs(trailingOnly = TRUE)
marker <- read.delim(args[1])
unknown <- read.delim(args[2])

getbpfrommm(marker, unknown, args[2])





# Invoke from command line
$ cat marker.csv
mm    bp
132.60    75
116.80    200
106.20    300
95.60    400
89.20    500
72.20    700
62.80    1000
48.30    1500
38.40    2000
19.1    5000
 

$ cat sample.csv
mm
103.4
98.8 


$ Rscript get_bp_from_mm.R marker.csv sample.csv
     mm  bp
1 103.4 272
2  98.8 317


$ cat sample.csv_with_bp.csv
mm    bp
103.4    272
98.8    317

2014-04-02

sed to edit annotation line of NCBI RefSeq fasta file

cat test.fa
 >gi|254939587|ref|NG_012567.1| Homo sapiens NADPH oxidase 1 (NOX1), RefSeqGene on chromosome X
ATTCTGTGATCACCAGCTTATCAAAAGACTTCCTAGTACTCTGATATTGGGAATGGGGGTCCTACCTCACAGACATAAGG
GTCCAATCAGCATGGCATATATAATTCTTTAGATAATACATAAATTGTCATCCAGATTATAGATCATTCTTTTATGAATC
ACAGGATCTCAATGTTGGAGTATATTTAAGGGACATTTAGTTAACCATCTACCTGGTGCTGATATTCCCCTTATAAAAAG
CTGACAAGGGGTTGTCCATTTTTCCTTGAGAGTCTCAAGTAATAGGAAACTCATTACCTCTTTACTTCTCTGAATACCCG
TGTTGGAAAATTCTGCTTTATATTGAAAAAAAATTGTGTTACTTTATTTATTTTTATTTTTATTTTTTGACACAGAATCT
TATTCTTTCGCCCAGTCTGGAGTGCAGTGGCGTGACCTTGGCTCACCACAACCCTCGCCTCCTGGATTCAAGCGATTCAA
GTGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCATGCACCACCATGCCAGGCTAATTTTTGTATTTTTGGTAGAGATG
GGGGTTTGCCATGTTGGCCAGCCAGGCTTGTCTCAAACTCCTGACCCCAGGTGATCTGCCCGCCTCAGCCTCCCAAAGTG
CTGGGATTATAGGCGTGAGCCACCATGCCCAACCTAAAGTGTGTTACTTTAGAGCCTCTATTCCTTGGTTGACTCCTTGG


sed -e'/>/s/^[^|]*\|\(.*\)\|\(.*\)\|.*$/>\2/g'  test.fa |  sed 's/\.[0-9]*//g' > test2.fa
cat test2.fa
>NG_012567
ATTCTGTGATCACCAGCTTATCAAAAGACTTCCTAGTACTCTGATATTGGGAATGGGGGTCCTACCTCACAGACATAAGG
GTCCAATCAGCATGGCATATATAATTCTTTAGATAATACATAAATTGTCATCCAGATTATAGATCATTCTTTTATGAATC
ACAGGATCTCAATGTTGGAGTATATTTAAGGGACATTTAGTTAACCATCTACCTGGTGCTGATATTCCCCTTATAAAAAG
CTGACAAGGGGTTGTCCATTTTTCCTTGAGAGTCTCAAGTAATAGGAAACTCATTACCTCTTTACTTCTCTGAATACCCG
TGTTGGAAAATTCTGCTTTATATTGAAAAAAAATTGTGTTACTTTATTTATTTTTATTTTTATTTTTTGACACAGAATCT
TATTCTTTCGCCCAGTCTGGAGTGCAGTGGCGTGACCTTGGCTCACCACAACCCTCGCCTCCTGGATTCAAGCGATTCAA
GTGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCATGCACCACCATGCCAGGCTAATTTTTGTATTTTTGGTAGAGATG
GGGGTTTGCCATGTTGGCCAGCCAGGCTTGTCTCAAACTCCTGACCCCAGGTGATCTGCCCGCCTCAGCCTCCCAAAGTG
CTGGGATTATAGGCGTGAGCCACCATGCCCAACCTAAAGTGTGTTACTTTAGAGCCTCTATTCCTTGGTTGACTCCTTGG

2014-02-18

LaTeX Cheat sheet

I am currently writing a long document in LaTeX and here is a small cheat sheet of functions that I use quite often:


To rescale tables/...

\resizebox{width}{height}{object} 
Needs package graphicx


\resizebox{\textwidth}{!}{...}


To create long table that has caption that splits over multiple pages

\begin{longtable}{P{.05\textwidth}P{.18\textwidth}P{.25\textwidth}P{.5\textwidth}}
 %%
   % Kopf erste Seite
        \caption[Short caption]{Looong caption.}\label{tab:mylabel} \\
        \toprule
                Nr. &  Nr2      &Example                 &  Description\\
        \midrule
        \endfirsthead
% Kopf weitere Seiten
        \caption[]{\emph{(Continued)}} \\

         \toprule
               Nr. &  Title      &Example                 &  Description\\
        \midrule
        \endhead

1&2&3&4\\
 
\bottomrule
\end{longtable}               


 

To create table with booktabs

\begin{table}
\caption[]{}    
\begin{center}
 % \resizebox{\textwidth}{!}{
    \begin{tabular}{ll}
    \toprule
      & \\

    \midrule
      & \\
    \bottomrule
    \end{tabular}
%}
 \end{center}
\end{table} 

2014-01-09

How to convert a BED file with one line per gene/item into a BED file with one line per block

If a BED file  looks for example like this

chr1    11873    14361    AM992877    0    +    11873    14361    0    3    354,109,1141,    0,739,1347,
chr1    11873    14361    AM992881    0    +    11873    14361    0    3    354,127,959,    0,721,1529,
chr1    11873    14362    AM992878    0    +    11873    14362    0    5    354,52,436,299,404,    0,772,1347,1785,2085,
chr1    11873    14362    AM992879    0    +    11873    14362    0    3    354,109,1142,    0,739,1347,
chr1    11873    14409    AM992871    0    +    11873    14409    0    3    354,109,1189,    0,739,1347,


it contains one line for every mRNA (in this case), giving the mRNA start and end coordinates in the first three columns. However, the blocks' coordinates are given in the last two columns and if you use a tool such as, e.g., Bedtools, you can only find results for the whole length of the mRNA.
I wanted to get results based on the individual blocks of the mRNA.

To get these I wrote a script that turns the above BED file into this format:
chr1    11873    12227    AM992877    0    +    11873    14361    0    3    354,109,1141,    0,739,1347,
chr1    12612    12721    AM992877    0    +    11873    14361    0    3    354,109,1141,    0,739,1347,
chr1    13220    14361    AM992877    0    +    11873    14361    0    3    354,109,1141,    0,739,1347,
chr1    11873    12227    AM992881    0    +    11873    14361    0    3    354,127,959,    0,721,1529,
chr1    12594    12721    AM992881    0    +    11873    14361    0    3    354,127,959,    0,721,1529,
chr1    13402    14361    AM992881    0    +    11873    14361    0    3    354,127,959,    0,721,1529,
chr1    11873    12227    AM992878    0    +    11873    14362    0    5    354,52,436,299,404,    0,772,1347,1785,2085,
chr1    12645    12697    AM992878    0    +    11873    14362    0    5    354,52,436,299,404,    0,772,1347,1785,2085,
chr1    13220    13656    AM992878    0    +    11873    14362    0    5    354,52,436,299,404,    0,772,1347,1785,2085,
chr1    13658    13957    AM992878    0    +    11873    14362    0    5    354,52,436,299,404,    0,772,1347,1785,2085,
chr1    13958    14362    AM992878    0    +    11873    14362    0    5    354,52,436,299,404,    0,772,1347,1785,2085,
chr1    11873    12227    AM992879    0    +    11873    14362    0    3    354,109,1142,    0,739,1347,
chr1    12612    12721    AM992879    0    +    11873    14362    0    3    354,109,1142,    0,739,1347,
chr1    13220    14362    AM992879    0    +    11873    14362    0    3    354,109,1142,    0,739,1347,
chr1    11873    12227    AM992871    0    +    11873    14409    0    3    354,109,1189,    0,739,1347,
chr1    12612    12721    AM992871    0    +    11873    14409    0    3    354,109,1189,    0,739,1347,
chr1    13220    14409    AM992871    0    +    11873    14409    0    3    354,109,1189,    0,739,1347,


This is the script, which works nice, if the BED file has the exact same format as in the example:


import sys

infile = sys.argv[1]

infile_handle = open(infile)

for line in infile_handle.readlines():
    chr, chr_start, chr_end, name, score, strand, thickStart, \
    thickEnd, itemRGB, blockCount, blockSizes, blockStarts = \
    line.split('\t')
    chr_start = int(chr_start)
    chr_end   = int(chr_end)
    for i, blockSize in enumerate(blockSizes.split(',')[:-1]):
        blockSize = int(blockSize)
        blockStart = int(blockStarts.split(',')[:-1][i])
        s = "\t".join([chr, str(chr_start + blockStart), str(chr_start + blockStart + blockSize), name, score, strand, thickStart, thickEnd, itemRGB, blockCount, blockSizes, blockStarts])
        print s,

2013-11-12

How to get chromosome coordinates from a string using RegEx in Python





def get_chr_coordinates_from_string(s):
	'''
	Returns chromosome coordinates from a given string.
	Coordinates have to be in "chr0:0000-0000" style!
	'''
	import re
	regex_for_chr       = re.compile('chr[0-9a-zA-Z_]*:')
	regex_for_start_end = re.compile('chr[0-9a-zA-Z_]*:[0-9]+-[0-9]+')
	chr  	   = re.search(regex_for_chr, s).group().replace(':', '')
	start, end = re.search(regex_for_start_end, s).group().split(':')[1].split('-')
	return(chr, start, end)

Example:

>>> s=">hg19_refGene_NM_012147 range=chr4:191008763-191010142 5'pad=0 3'pad=0 strand=+ repeatMasking=none"
>>> get_chr_coordinates_from_string(s)
('chr4', '191008763', '191010142')


Code highlighted using http://pygments.org/

2013-09-10

How to install LaTeX packages

Since I tend to forget how easy it is to install LaTeX packages, here is a very short how-to:
  • Go to http://www.ctan.org/tex-archive/install/macros/latex/contrib and find the package you want
  • download the .tds.zip file
  • change to the folder were your downloaded file was saved to
  • unzip it to the directory were your texmf is stored, for me (Ubuntu) i.e.: /home/username/texmf/
  • update tex, in the terminal, type: texhash ~/texmf/
  • Done!
Or, in the terminal with the glossaries package as an example:

cd ~/Downloads/
wget http://www.ctan.org/tex-archive/install/macros/latex/contrib/glossaries.tds.zip

unzip ~/Downloads/glossaries.tds.zip -d ~/texmf/

texhash ~/texmf/

Note: This way, the new package is only installed for you, not for all users!

2013-09-05

Create a Multiple Sequence Alignment with masked parts and replace masked parts by propper sequence

The reason why I wrote the following Script is that I wanted to align a couple of sequences which contained repeats. Not surprisingly the alignment muscle produced did not look very nice, due to those repeats.

Unfortunately, one cannot tell muscle to ignore all small caps letters (afaik), hence I replaced all small caps by 'N' s. Making again an alignment produced a much better result. However, my sequences still contain the 'N's, of course. This is where the small Script kicks in: It replaces all N's with the character that is supposed to be found at this position.

Here is the code of the Script:

#! /usr/bin/python


"""

Script to "unmask" an aligned sequence.

Takes an alignment file of masked sequences and a file with the unmasked sequences

and returns an alignment file with unmasked sequences.

Written by: Tabea Hoehmann

"""

import sys

from textwrap import wrap

def read_fasta_file(fasta_filename):

"""
Read a fasta file and save the sequences to a dictionary.
Returns a dictionary with the sequence id as key and the
sequence as value.
"""
seq={}
fasta_filehandle = open(fasta_filename, 'r')
i=0
for line in fasta_filehandle.readlines():
if '>' in line:
name = line.strip().replace('>','').upper() ### !!!! Only the first element !!!!
seq[name] = ""
else:
seq[name] += line.strip()
if not set(line.strip()).issubset(set('actgkrynACTGKRYN-')):
print("Unknown character(s) in line %s of sequence file %s: %s\nScript proceeds as if nothing happened..."%(i, fasta_filename, ", ".join(set(line)-set('actgkrynACTGKRYN-'))))
i+=1
fasta_filehandle.close()
return(seq)


# Read fasta files

try:
masked_aln_fasta = read_fasta_file(sys.argv[1])
unmasked_fasta   = read_fasta_file(sys.argv[2])
except IndexError:
print("Usage: unmask_fasta_aln.py file1 file 2")
print("file1: The alignment file that is masked by N's")
print("file1: The fasta file that is not masked")
sys.exit()

# New dictionary to store edited sequences

new_unmasked_aln_fasta = {}

# Iterate over every sequence, replace all N's by nucleotides

for masked_id, masked_seq in masked_aln_fasta.items():
unmasked_seq = unmasked_fasta[masked_id]
new_unmasked_aln_seq = ""
nt_pos = 0
for nt in masked_seq:
if nt not in 'nN':
new_unmasked_aln_seq += nt
else:
new_unmasked_aln_seq += unmasked_seq[nt_pos]
if nt != '-':
nt_pos+=1
new_unmasked_aln_fasta[masked_id] = new_unmasked_aln_seq


# Print result to stdout

for id, seq in new_unmasked_aln_fasta.items():
print('>%s\n%s'%(id, '\n'.join(wrap(seq, 100, break_on_hyphens=False))))

2013-07-29

Use cut to extract a certain text range from a one-line DNA sequence

I wanted to get the characters at position 27947 to 28203 from a text file containing just one DNA sequence as a very long line. This is the cut command that worked:

cut -c 27947-28203 file.txt

2013-07-19

How to automatically create Venn diagram for five sets

Wikipedia has a very nice Venn diagram for five sets. I use this svg file to automatically create a Venn diagram displaying the number of intersecting elements of five sets.

Number of intersecting elements

For this we need the number of intersecting elements for all possible set combinations, in my example it looks like this:

numbers.txt
A 13644
B 14729
C 14690
D 13725
E 13742
AB 3689
AC 3616
AD 3523
AE 3496
BC 14281
BD 12852
BE 12694
CD 13215
CE 13060
DE 13563
ABC 3609
ABD 3507
ABE 3480
ACD 3513
ACE 3487
ADE 3496
BCD 12849
BCE 12694
BDE 12694
CDE 13056
ABCD 3506
ABCE 3480
ABDE 3480
ACDE 3487
BCDE 12694
ABCDE 3480

To get these numbers I used a small R script and formatted the output with a little bit of TextWrangler and grep to look like in the file numbers.txt

Here is the short R script:

A <- read.delim('Info1.csv')$name
B <- read.delim('Info2.csv')$name
C <- read.delim('Info3.csv')$name
D <- read.delim('Info4.csv')$name
E <- read.delim('Info5.csv')$name


# all sets
print(paste("A",length(A)))
print(paste("B",length(B)))
print(paste("C",length(C)))
print(paste("D",length(D)))
print(paste("E",length(E)))

# all combinations of two sets
print(paste("A, B", length(intersect(A, B))))
print(paste("A, C", length(intersect(A, C))))
print(paste("A, D", length(intersect(A, D))))
print(paste("A, E", length(intersect(A, E))))
print(paste("B, C", length(intersect(B, C))))
print(paste("B, D", length(intersect(B, D))))
print(paste("B, E", length(intersect(B, E))))
print(paste("C, D", length(intersect(C, D))))
print(paste("C, E", length(intersect(C, E))))
print(paste("D, E", length(intersect(D, E))))
# all combinations of three sets
print(paste("A, B, C", length(intersect(A, intersect(B, C)))))
print(paste("A, B, D", length(intersect(A, intersect(B, D)))))
print(paste("A, B, E", length(intersect(A, intersect(B, E)))))
print(paste("A, C, D", length(intersect(A, intersect(C, D)))))
print(paste("A, C, E", length(intersect(A, intersect(C, E)))))
print(paste("A, D, E", length(intersect(A, intersect(D, E)))))
print(paste("B, C, D", length(intersect(B, intersect(C, D)))))
print(paste("B, C, E", length(intersect(B, intersect(C, E)))))
print(paste("B, D, E", length(intersect(B, intersect(D, E)))))
print(paste("C, D, E", length(intersect(C, intersect(D, E)))))
# all combinations of four sets
print(paste("A, B, C, D", length(intersect(A, intersect(B, intersect(C, D))))))
print(paste("A, B, C, E", length(intersect(A, intersect(B, intersect(C, E))))))
print(paste("A, B, D, E", length(intersect(A, intersect(B, intersect(D, E))))))
print(paste("A, C, D, E", length(intersect(A, intersect(C, intersect(D, E))))))
print(paste("B, C, D, E", length(intersect(B, intersect(C, intersect(D, E))))))
# combination of five sets
print(paste("ABCDE", length(intersect(A, intersect(B, intersect(C, intersect(D, E)))))))

Getting these numbers into the diagram

Since the svg file only consists of 51 lines, I display it here:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="746" height="742" viewBox="-362 -388 746 742">
 <title>Radially-symmetrical Five-set Venn Diagram</title>
 <desc>Devised by Branko Gruenbaum and rendered by CMG Lee.</desc>
 <defs>
  <ellipse id="ellipse" cx="36" cy="-56" rx="160" ry="320" />
  <g id="ellipses">
   <use xlink:href="#ellipse" fill="#0000ff" />
   <use xlink:href="#ellipse" fill="#0099ff" transform="rotate(72)" />
   <use xlink:href="#ellipse" fill="#00cc00" transform="rotate(144)" />
   <use xlink:href="#ellipse" fill="#cc9900" transform="rotate(216)" />
   <use xlink:href="#ellipse" fill="#ff0000" transform="rotate(288)" />
  </g>
 </defs>
 <use xlink:href="#ellipses" fill-opacity="0.3" />
 <use xlink:href="#ellipses" fill-opacity="0" stroke="#000" stroke-width="2" />
 <g text-anchor="middle" font-family="sans-serif" font-size="16">
  <text x="30"   y="-300" dy="0.7ex" font-size="64">A</text>
  <text x="300"  y="-60"  dy="0.7ex" font-size="64">B</text>
  <text x="160"  y="280"  dy="0.7ex" font-size="64">C</text>
  <text x="-220" y="220"  dy="0.7ex" font-size="64">D</text>
  <text x="-280" y="-130" dy="0.7ex" font-size="64">E</text>
  <text x="180"  y="-130" dy="0.7ex">AB</text>
  <text x="40"   y="230"  dy="0.7ex">AC</text>
  <text x="100"  y="-200" dy="0.7ex">AD</text>
  <text x="-80"  y="-215" dy="0.7ex">AE</text>
  <text x="190"  y="125"  dy="0.7ex">BC</text>
  <text x="-190" y="120"  dy="0.7ex">BD</text>
  <text x="230"  y="40"   dy="0.7ex">BE</text>
  <text x="-60"  y="220"  dy="0.7ex">CD</text>
  <text x="-170" y="-150" dy="0.7ex">CE</text>
  <text x="-222" y="0"    dy="0.7ex">DE</text>
  <text x="90"   y="150"  dy="0.7ex">ABC</text>
  <text x="148"  y="-153" dy="0.7ex" font-size="14">ABD</text>
  <text x="170"  y="-20"  dy="0.7ex">ABE</text>
  <text x="-33"  y="208"  dy="0.7ex" font-size="14">ACD</text>
  <text x="-93"  y="-193" dy="0.7ex" font-size="14">ACE</text>
  <text x="20"   y="-180" dy="0.7ex">ADE</text>
  <text x="-120" y="120"  dy="0.7ex">BCD</text>
  <text x="190"  y="100"  dy="0.7ex" font-size="14">BCE</text>
  <text x="-211" y="32"   dy="0.7ex" font-size="14">BDE</text>
  <text x="-150" y="-80"  dy="0.7ex">CDE</text>
  <text x="-30"  y="160"  dy="0.7ex">ABCD</text>
  <text x="140"  y="80"   dy="0.7ex">ABCE</text>
  <text x="120"  y="-100" dy="0.7ex">ABDE</text>
  <text x="-60"  y="-140" dy="0.7ex">ACDE</text>
  <text x="-160" y="20"   dy="0.7ex">BCDE</text>
  <text x="0"    y="0"    dy="0.7ex">ABCDE</text>
 </g>
</svg>

You can see, that the names of the different overlapping fields are annotated as "A", "B", "AB", and so forth. Hence, I can just replace them with sed, to display my own data!

This I do with the following bash/sed command:

while read LINE
do
 NAME=$( echo $LINE | awk '{print $1 }')
 NUMB=$( echo $LINE | awk '{print $2 }')
 echo $NAME $NUMB
 sed -i tmp "s/\>${NAME}\</\>${NUMB}\</g" venn.svg
 echo "sed -i tmp "s/\>\${NAME}\</\>\${NUMB}\</g" venn.svg"
done < 'numbers.txt'

The altered Venn diagram

The output is indeed exactly what I wanted:

Of course, I still need to adapt the font size, but thanks to sed, this won't be a big issue.


How to convert fasta file to uppercase/lowercase, without altering the annotation line

This awk-oneliner converts all lines which do not contain a '>' in the beginning to upper case:

$ awk '{ if ($0 !~ />/) {print toupper($0)} else {print $0} }' name.fasta

How to see stdout output in the terminal and pipe it into a file

Since I always have to look up how to print the stdout output to the screen AND to a file at the same time, I put this small command here:

command 2>&1 | tee outputfile.txt

How to convert .maf alignment format to a RepeatMasker-output-file-like format

The following python script takes a .maf alignment file, which looks for example like this:


# lastz.v1.02.00 --format=maf --output=NM_144963.2-3.maf 
#
# hsp_threshold      = 3000
# gapped_threshold   = 3000
# x_drop             = 910
# y_drop             = 9400
# gap_open_penalty   = 400
# gap_extend_penalty = 30
#        A    C    G    T
#   A   91 -114  -31 -123
#   C -114  100 -125  -31
#   G  -31 -125  100 -114
#   T -123  -31 -114   91
a score=11361
s target 0 133 +  5344 GGCGGGAGGCTTAGGGTGGGTCGCGGTCGCGGCCCCCGCCGCAGGCCCTGGCGGCGGCATCGTGAACATAGACGTGGAGTTCCACATCCGGCACAACTACCCCTGGAACAAGTTGCCGGCCAACGTGAGGCAG
s query           0 133 + 46809 GGCGGGAGGCGTAGTGTGGGTCGCGGTGGCGGCCCCGGCCGCCGGCCCTGGCCGCGGCATCATGAACATAGACGTGGAGTTCCACATCCGGCACAACTACCCCTGGAACAAGTTGCCGGCCAACGTGAGACAG

a score=7552
s target  207 151 +  5344 tacaggaagtattggtcttgctgttcagagtggcagaggagaaaga--------aaatcccagcC---GTCTTGGAAATTCACAGAGAGAATATGAAAATCAGGTTGTCCTGTACAGTATCTGCAATCAGTTACGATACAGAAATAACTTAAACATGTCAAG
s query           5371 162 + 46809 tatgggaaaaaatagtactattgttcagtttggaagttgtttaaaatctttttgaaatttctcctagAGTCTTGGAAATTCACAGAGAGAATATGAAAAGCAGGTTGTCCTGTACAGTATCCGCAATCAGTTACGATATAGAAATAACTTAGgtaagtagag

a score=13285
s target  344 152 +  5344 TTAAACATGTCAAGAAAGATGAACGCAGATACTATGAGGAACTGCTAAACTACAGCCGAGATCATCTCATGCTGTACCCTTACCGTCTATTGGATATTATGGTGAAGAGCTTGAGCATAACACCATTTTCATATTACACTGGGATTATGGAG
s query           6505 152 + 46809 TTAAACATGTCAAGAAAGATGAACGCAGATACTATGAGGAACTGCTAAAGTACAGCCGAGATCATCTCATGCTGTACCCTTACCATCTATCGGATATTATGGTGAAAGGCTTGAGGATAACACCATTTTCATATTATACTGGGATTATGGAG

a score=6190
s target  536 68 +  5344 GTCTAAGGCTTCTTGGCATAGGAAGAAACCAGTATATTGATCTTATGATTCAGTGTAGATCATCAAAA
s query           9383 68 + 46809 GTCTAAGGCTTCTTGGCATAGGAAGAAACCAGTATATTGATCTTATGAATCAGTGTAGATCATCAAAA


and converts it into a RepeatMasker-output-like format that looks like this:
SW      subst   del_in_retro    insert_in_retro query   qstart  qend    qleft   compl   match   match_class     mstart  mend    mleft   id
11361 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 0 133 5211 +,+ hg19_refGene_NM_144963 nA 0 133 46676 1
7552 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 207 358 4986 +,+ hg19_refGene_NM_144963 nA 5371 5533 41276 2
13285 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 344 496 4848 +,+ hg19_refGene_NM_144963 nA 6505 6657 40152 3
6190 nA nA nA hg19_ucscRetroAli5_NM_144963.2-3 536 604 4740 +,+ hg19_refGene_NM_144963 nA 9383 9451 37358 4

I save also the output that will not go into the new file into variables, in case I maybe want to include it as well.




import sys
INFILENAME=sys.argv[1]

WITH_HEADER=False
try:
WITH_HEADER=sys.argv[2]
except:
pass

INFILE = open(INFILENAME, 'r')

ALNS=[]
i=0
for LINE in INFILE.readlines():
if 'a score=' in LINE:
SCORE=LINE.split('=')[1].strip()
i+=1
QUERY_PASSED=False
MATCH_PASSED=False
elif LINE.startswith('s') and not QUERY_PASSED:
   MUERY, MSTART, MSIZE, MSTRAND, MSRCSIZE, MSEQUENCE = LINE.split()[1:]
   MEND = str(int(MSTART) + int(MSIZE))
   MLEFT = str(int(MSRCSIZE) - int(MEND))
   MUERY_PASSED = True
elif LINE.startswith('s') and QUERY_PASSED:
   QATCH, QSTART, QSIZE, QSTRAND, QSRCSIZE, QSEQUENCE = LINE.split()[1:]
   QEND = str(int(QSTART) + int(QSIZE))
   QLEFT = str(int(QSRCSIZE) - int(QEND))
   QATCH_PASSED=True
   ALNS.append((SCORE, 'nA', 'nA', 'nA', QUERY, QSTART, QEND, QLEFT, MSTRAND, MATCH, 'nA', MSTART, MEND, MLEFT, str(i)))

INFILE.close()

if WITH_HEADER:
print("SW      subst   del_in_retro    insert_in_retro query   qstart  qend    qleft   compl   match   match_class     mstart  mend    mleft   id")    
for LINE in ALNS:
print('\t'.join(LINE))
 

2013-02-18

Use Imagemagick to swap blue and red colors

For a presentation I wanted an image to appear in magenta, rather than in cyan. This is how the image originally looks like:




blue.jgp

With the following command I separated the R, G, B color channels and swapped the Red and the Green channel, to produce a picture in which all blue colors appear in red:

$ convert blue.jpg -separate -swap 0,1 -combine red.jpg

 And that is how the result looks like:




red.jpg


I got this information from: http://www.imagemagick.org/Usage/color_basics/#combine

2013-01-30

RepeatMasker ambiguous bases

When trying to use a chromosome fasta file with RepeatMasker, one might gets the error message that "ambiguous bases" were detected because the fasta file contains long stretches of N's.
In order to avoid this problem, you can replace the "N" characters by "X"'s before starting RepeatMasker.

2012-11-13

How to split a string in R using strsplit

Since the first google and Rseek.org results did not help me with my problem of splitting a string in R, I want to clip it here, so I do not have to click through all of these pages again :)

The command is as easy as this:


> y <- "100,200,300"
> x <- strsplit(y, ',')
> x
[[1]]
[1] "100" "200" "300"

2012-10-31

R: Function to perform t-test on all possible combinations of columns in a dataframe



Sometimes I need to perform a t-test on all possible combinations of the columns of a data-frame and rather than computing them all one by one, I wrote this small function:


ttest_all_possible  <- function(dataframe) {
    nr_columns = dim(dataframe)[2]
    results <- matrix(NA, nr_columns, nr_columns)
    for (i in seq(1, nr_columns)){
        for (j in seq(1, nr_columns)){
            results[i, j] <- t.test(dataframe[,i], dataframe[,j])$p.value
        }
    }
    return(results)
}


Works fine:
> test <- cbind(rnorm(10),rnorm(10), rnorm(10))
> test
             [,1]       [,2]       [,3]
 [1,]  0.64061950 -0.5643179  0.7450485
 [2,]  1.90703328 -0.8000197  0.2444611
 [3,]  0.96346636 -0.6278075  0.1283072
 [4,] -0.25758024 -0.9657888 -0.8832595
 [5,]  0.36934381  1.4149775  0.8837320
 [6,]  0.60584130  0.2556500 -0.2437533
 [7,] -0.24559562 -0.7774171  0.1279253
 [8,] -1.53276425  0.3557177  0.1611155
 [9,] -1.55474156  0.3488891  0.1025763
[10,] -0.08366046 -0.2823561  1.3618903
> ttest_all_possible(test)
          [,1]      [,2]      [,3]
[1,] 1.0000000 0.5603004 0.6495045
[2,] 0.5603004 1.0000000 0.1817692
[3,] 0.6495045 0.1817692 1.0000000

2012-10-30

How to substutitute characters in all files of a directory using regex and perl

perl -i -pe 's/CharacterToSubstiture/SubstituteWithThis/g' myfilename

Example:
Imagine I have got these two files:

shopping_list_groceries.txt:
- 2 apples
- 3 red paprika

shopping_list_clothes.txt:
- 2 pairs of socks
- 1 red t-shirt

And now I decide to not like red any more, but to prefer green:

perl -i -pe 's/red/green/g' shopping_list*txt


The text files now look like this:
shopping_list_groceries.txt:
- 2 apples
- 3 green paprika

shopping_list_clothes.txt:
- 2 pairs of socks
- 1 green t-shirt


What does the one-liner do?
perl # use perl :)
-i # change in the file, if you skip this part, the changes will be printed to stdout
-pe # I don't know
's/red/green/g' # RegEx: 
                 's # Substitute...
                 /red # ...every occurence of the string "red"...
                 /green # ...with the string "green"...
                 /g' # ... everywhere ...
shopping_list*.txt # ...in all files starting with 'shopping_list' and ending with '.txt



2012-10-02

How to print all the attributes of an object in python

from pprint import pprint
pprint (vars(my_object))     # e.g. class instance
 
 
This prints all the attributes/properties this object currently has in a neat way.

2012-09-19

How to create a tar archive

tar cvzf archive.tar.gz *files*

2012-09-18

Bash Shell: Check if a file exists, and has a size larger than 0

To check, whether a file exists and has a file size larger than 0 use this expression:



if [ -s someFile.txt ]
then
  ...
fi

Bash Shell: Check if a file exists

To check, whether a file exists use this expression:

First, create the file thisFileExists.txt:
touch thisFileExists.txt

Now, check if it exists:
if [ -f thisFileExists.txt ]
then
  echo It is there
fi

This is what it looks like in terminal:

Bash Shell: Repeat a command every x seconds, e.g. to update list of files

When running a command that changes/creates files and takes a while, one might be interested in seeing how progress is going.
For this the "watch" command can be useful:


watch -n 2 ls *file

This will print the output of "ls *file" every 2 seconds.


watch -n 2 -d ls *file

Adding -d will highlight all the changes in comparison to the last time the command was performed

2012-07-28

Creating an animated gif image

Imagemagick can also be used to create animations!

convert -delay 10 -loop 0 DSC003* animation.gif

Where DSC003* are the files that are going to be used for the animations. They can be in about any format.

As a (rather stupid) example, I created 16 barplots in R with different colors and animated this:

in R:
$ pdf('example.pdf')
$ for (i in seq(5,20)){barplot(1:20,col=rainbow(i))}
$ dev.off()

in Shell:
$ convert example.pdf example.png
$ convert -delay 10 -loop 1 example-*png animation.gif


The 16 different barplots.
The slightly psychedelic animation.