2014-04-10

Convert UCSC Aln

UCSC 100-Way Genome Alignment, downloaded from their website,
is in strange format, hence:
1. Delete all lines not belonging to alignment
2. Execute RegEx, to include a tab between species name and sequence, but not in between different words in a species name
perl -pe  "s/(([a-zA-Z-\(\)'\.]+\s){1,})([Nacgt=-]+)/\1\t\3/g"
3. Upload file to Galaxy
4. Tabular-To-Fasta Converter
5. Concatenate Fasta by Species

Done!

No comments:

Post a Comment