----------------------------------------------------------------------- ISAR: Isoform Structure Alignment Representation ISAR v.1.0 (23-02-2016) https://www.bio.ifi.lmu.de/softwareservices/isar Robert Pesch, Gergely Csaba and Ralf Zimmer {pesch, csaba, zimmer}@bio.ifi.lmu.de ----------------------------------------------------------------------- 1.) COMPUTE ISARs ISAR uses oracles in order to compute isoform consistent alignments. The default pair-wise sequence alignment oracle is shipped with ISAR. Additionally, the PRRN multiple sequence aligner and the SPALN2 spliced-aligner can be integrated. Also pre-computed alignments can be provided for the ISAR computation. Install and download the executables from: http://www.genome.ist.i.kyoto-u.ac.jp/~aln_user/prrn/ and http://www.genome.ist.i.kyoto-u.ac.jp/~aln_user/spaln/ ISAR can be started with: >java -jar ISAR-1.0.jar Note: for large ISARs the memory usage may needs to be increased (with -Xmx) in order to avoid out of memory exceptions. e.g. java -Xmx4G -jar ISAR-1.0.jar Examples: Compute an ISAR for three genes containing two paralogous human genes and one orthologous yeast gene with in total 9 sequences (inputs are provided in the example directory): java -jar ISAR-1.0.jar -in examples/RAB1A.fasta,examples/RAB1B.fasta,examples/YPT1.fasta -out RAB1 -fasta -exon_view Compute a ISAR with additional spliced-aligner and multiple sequence aligner oracles: java -jar ISAR-1.0.jar -in examples/RAB1A.fasta,examples/RAB1B.fasta,examples/YPT1.fasta -out RAB1 -fasta -exon_view -spaln2 tools/spaln2/bin/spaln -prrn tools/prrn/bin/prrn The default output of ISAR is a xml.gz file which includes gene sequences, protein sequences and the aligned regions. With the -fasta and -exon_view parameter a FASTA multiple sequence alignment and a graphical representation of the alignment can be created. Integrate pre-computed pairwise and multiple alignments between (subsets) of the isoforms into the ISAR construction: java -jar ISAR-1.0.jar -in examples/RAB1A.fasta,examples/RAB1B.fasta,examples/YPT1.fasta -alignment examples/clustal/rab.aln -out RAB1 Combine different ISARs: java -jar ISAR-1.0.jar -in examples/RAB1A.fasta,examples/RAB1B.fasta -out RAB1AB java -jar ISAR-1.0.jar -in examples/RAB1A.fasta,examples/YPT1.fasta -out RAB1 java -jar ISAR-1.0.jar -in RAB1,RAB1AB -out RAB_comb Compute ISAR for the RAB-Family: see rab1a_isar.sh Compute ISAR for the RAB-Family: see rab1a_isar.sh in examples see polr3h_isar.sh in examples 2. ISAR INPUT The ISAR input consists of gene sequences with protein annotations. Each gene must be defined in a separate FASTA file. We provide scripts to generate the required input files for genes of interest (see 2b and 2c). 2a. ISAR INPUT FORMAT An ISAR gene FASTA file starts with the gene identifier. Additionally, the gene name, the taxonomy id and the genomic location can be provided (space separated). e.g. >>ENSG00000138069 RAB1A 9606 2:65297835-65357240:0 The first sequence must be the entire gene sequence of the gene, e.g. ATTTTGGGTGGAAGCGATAGCTGAGTGGCGGCGGCTGCTGATTGTGTTCTAGGGGACGGAGTAGGGGAAGACGTTTGCTC TCCCGGAACAGCCTATCTCATTCCTTTCTTTCGATTACCCGTGGCGCGGAGAGTCAGGGCGGCGGCTGCGGCAGCAAGGG CGGCGGTGGCGGCGGCGGCAGCTGCAGTGACATGTCCAGCATGAATCCCGAATAGTGAGTTCAGGAGAGCACCGGTCGGC TGGGTCCGTGGGCCAGCTTGGGGGATCTTAAAGGGGTCGAGGAGGGTTGGGGCAGAAGTCGGGGCATCGGCTGGGGTGAG GCGAGGGTGATGGGTCAGGAGAGGCTGGCGGCCGGGAGTCGGGCCCCATTGTCTGACGCGGAGGGGCGGCCGCGCGGGGG AGGGGTCGGGCCGGAGGGGTGAGCCGCCCGGGCCTGGACCGGGTCAGGTTAGAGGGCCTGACTGCGGGGCGGGTGCTGAG GAAGCCTGCCGAGGGGCCTGGGGCGGTGTGAAGGGGTATCTTCTCTCGGAGGCAGTGACTTTTGAAGGAGGACTTGTCTC ... TGGGTTTTGGAGCCAGACAAACCTGAGGCCAGCCCCTAACTTTGCCACCTTCCAGATGTATGACTTCGAGCCTCACTTTC CTCCCCTCATAGCACCAGCTCCTCAGAAGGTAAGGATTAAATACACCCATGTGTGCAAAGCACATGGTAAGAGCTTAACG AACAACAGGAATTCTAATAATAAGCACCGGACTCATCCCCGTGAAA After that, protein annotations should be provided. A protein annotation starts with an identifier. e.g. >ENSP00000386672 After the protein identifier exon annotation should be provided. An exon annotation consists of a start position (inclusive) and an end position (exclusive). The position information is relative to the provided gene sequence. Each exon must be defined in a separate line. e.g. (five exons): ;191 214 ;25300 25373 ;39028 39124 ;41036 41168 ;41416 41611 Finally, the amino-acid or DNA sequence for the protein can be provided. e.g. ATGTCCAGCATGAATCCCGAATATGATTATTTATTCAAGTTACTTCTGATTGGCGACTCAGGGGTTGGAAAGTCTTGCCT TCTTCTTAGGTTTGCATGGGACACAGCAGGCCAGGAAAGATTTCGAACAATCACCTCCAGTTATTACAGAGGAGCCCATG GCATCATAGTTGTGTATGATGTGACAGATCAGGAGTCCTTCAATAATGTTAAACAGTGGCTGCAGGAAATAGATCGTTAT GCCAGTGAAAATGTCAACAAATTGTTGGTAGGGAACAAATGTGATCTGACCACAAAGAAAGTAGTAGACTACACAACAGC GAAGGAATTTGCTGATTCCCTTGGAATTCCGTTTTTGGAAACCAGTGCTAAGAATGCAACGAATGTAGAACAGTCTTTCA TGACGATGGCAGCTGAGATTAAAAAGCGAATGGGTCCCGGAGCAACAGCTGGTGGTGCTGAGAAGTCCAATGTTAAAATT CAGAGCACTCCAGTCAAGCAGTCAGGTGGAGGTTGCTGC The sequence is used to check the exon annotation. Note: ISAR is able to infer exon annotation using the spaln2 spliced-aligner (the -spaln2 option needs to be set). Thus, it is also possible to just provide the amino-acid or DNA sequence for the proteins of interest. See for example/RAB1A_missing.fasta. 2b. ENSEMBL PERL API We provide a PERL script to create the necessary ISAR input information from the ENSEMBL database for a given gene identifier (a internet connection is required). The script uses the BioPERL and the ENSEMBL PERL API. See http://www.ensembl.org/info/docs/api/api_installation.html for installation instructions. To start the script: perl scripts/seq_ensembl_dl.pl The following non-optional parameter(s) are missing: -s ENSEMBL species name e.g. homo_sapiens, saccharomyces_cerevisiae (default: homo_sapiens) -g ENSEMBL gene id e.g. ENSG00000138069, YPT1 See scripts/seq_ensembl_species for possible seq_names (current 66 species). Examples: perl scripts/seq_ensembl_dl.pl -g ENSG00000138069 > RAB1A_human.fasta perl scripts/seq_ensembl_dl.pl -g ENSG00000174903 > RAB1B_human.fasta perl scripts/seq_ensembl_dl.pl -g ENSMUSG00000020149 -s mus_musculus > RAB1A_mouse.fasta perl scripts/seq_ensembl_dl.pl -g ENSDARG00000029663 -s danio_rerio > RAB1A_fish.fasta perl scripts/seq_ensembl_dl.pl -g YPT1 -s saccharomyces_cerevisiae > YPT1_yeast.fasta 2c. GENOME AND GTF files For batch processing it is advisable to download the entire genome and GTF (gene transfer files) files from ENSEMBL and use the provided scripts in order to create the ISAR inputs. Genomes and GTFs can be downloaded from: ftp://ftp.ensembl.org/pub/release-81/fasta/ ftp://ftp.ensembl.org/pub/release-81/gtf/ Additional further genomes for fungi, bacteria, metazoa, plants and protists can be downloaded from: ftp://ftp.ensemblgenomes.org/pub/release-28/ To start the script execute; java -cp ISAR-1.0.jar isoformtransfer.isar.tools.InputGenerator for cmdline the following non-optional parameters are not set: -gtf -gene_ids -fasta_dir -out_dir parameters for cmdline: -gtf -gene_ids -fasta_dir [-taxId >] -filter_identical -out_dir Examples: Download and extract yeast genome and GTF: mkdir yeast cd yeast wget ftp://ftp.ensembl.org/pub/release-81/fasta/saccharomyces_cerevisiae/dna/*.dna.chromosome.* wget ftp://ftp.ensembl.org/pub/release-81/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.81.gtf.gz . gunzip *.gz cd .. Download and extract fly genome and GTF: mkdir fly cd fly wget ftp://ftp.ensembl.org/pub/release-81/fasta/drosophila_melanogaster/dna/*.dna.chromosome.* wget ftp://ftp.ensembl.org/pub/release-81/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.81.gtf.gz gunzip *.gz cd .. java -cp ISAR-1.0.jar isoformtransfer.isar.tools.InputGenerator -gtf fly/Drosophila_melanogaster.BDGP6.81.gtf -fasta_dir fly -gene_ids FBgn0031090,FBgn0031882 -out_dir . -taxId 7227 java -cp ISAR-1.0.jar isoformtransfer.isar.tools.InputGenerator -gtf yeast/Saccharomyces_cerevisiae.R64-1-1.81.gtf -fasta_dir yeast/ -gene_ids YFL038C -out_dir . -taxId 6239 The script creates for each given gene ID a FASTA file, which can be used as input file for ISAR. 3.) PROCESS ISARs FASTA and graphical representations of an ISAR can be produced from the computed xml.gz files. ISAR requires the following non-optional parameters: -out_file -isar -out_type INPUT/OUTPUT: -isar FILE input ISAR xml.gz file -out_file FILE output file -out_type STR output type Options: report_unmatched, png_simple, png_event, png, isar_mul_ali, fasta PNG OPTIONS: [-png_scale_factor =1 Scale for output visualization] [-intron_size =30 size of introns in png] -show_isoform_labels GENERAL: [-org_xml organism configuration file] [-genes STR comma seperated list of genes to be included] Examples for RAB1 ISAR: Report unmatched regions: java -cp ISAR-1.0.jar isoformtransfer.isar.Formater -isar RAB1 -out_type report_unmatched -out_file RAB1.unmatched Draw Hasse diagram: java -cp ISAR-1.0.jar isoformtransfer.isar.Formater -isar RAB1 -out_type png_simple -out_file RAB1.png Create ordinary fasta output: java -cp ISAR-1.0.jar isoformtransfer.isar.Formater -isar RAB1 -out_type fasta -out_file RAB1.fasta Create exon-view: java -cp ISAR-1.0.jar isoformtransfer.isar.Formater -isar RAB1 -out_type png -out_file RAB1.png