-----------------------------------------------------------------------
  ISAR: Isoform Structure Alignment Representation
  ISAR v.1.0 (23-02-2016)

  https://www.bio.ifi.lmu.de/softwareservices/isar
  Robert Pesch, Gergely Csaba and Ralf Zimmer 
  {pesch, csaba, zimmer}@bio.ifi.lmu.de
-----------------------------------------------------------------------

1.) COMPUTE ISARs
  
  ISAR uses oracles in order to compute isoform consistent alignments. The default pair-wise sequence alignment oracle is shipped with ISAR.
  
  Additionally, the PRRN multiple sequence aligner and the SPALN2 spliced-aligner can be integrated. Also pre-computed alignments can be provided for the ISAR computation.
  Install and download the executables from:
  http://www.genome.ist.i.kyoto-u.ac.jp/~aln_user/prrn/
  and
  http://www.genome.ist.i.kyoto-u.ac.jp/~aln_user/spaln/

  ISAR can be started with:
       
  >java -jar ISAR-1.0.jar 

  Note: for large ISARs the memory usage may needs to be increased (with -Xmx<Size>) in order to avoid out of memory exceptions. e.g.
  java  -Xmx4G  -jar ISAR-1.0.jar 
  
  Examples:
  
  Compute an ISAR for three genes containing two paralogous human genes and one orthologous yeast gene with in total 9 sequences  
  (inputs are provided in the example directory):

  java -jar ISAR-1.0.jar -in examples/RAB1A.fasta,examples/RAB1B.fasta,examples/YPT1.fasta -out RAB1 -fasta -exon_view

  Compute a ISAR with additional spliced-aligner and multiple sequence aligner oracles:

  java -jar ISAR-1.0.jar -in examples/RAB1A.fasta,examples/RAB1B.fasta,examples/YPT1.fasta -out RAB1 -fasta -exon_view -spaln2 tools/spaln2/bin/spaln -prrn tools/prrn/bin/prrn
    
  The default output of ISAR is a xml.gz file which includes gene sequences, protein sequences and the aligned regions.
  With the -fasta and -exon_view parameter a FASTA multiple sequence alignment and a graphical representation of the alignment can be created.

  Integrate pre-computed pairwise and multiple alignments between (subsets) of the isoforms into the ISAR construction:
  
  java -jar ISAR-1.0.jar -in examples/RAB1A.fasta,examples/RAB1B.fasta,examples/YPT1.fasta -alignment examples/clustal/rab.aln -out RAB1
   
  Combine different ISARs:

  java -jar ISAR-1.0.jar -in examples/RAB1A.fasta,examples/RAB1B.fasta -out RAB1AB
  java -jar ISAR-1.0.jar -in examples/RAB1A.fasta,examples/YPT1.fasta -out RAB1
  
  java -jar ISAR-1.0.jar -in RAB1,RAB1AB -out RAB_comb

  Compute ISAR for the RAB-Family:
  see rab1a_isar.sh

  Compute ISAR for the RAB-Family:
  see rab1a_isar.sh in examples
  see polr3h_isar.sh in examples

2. ISAR INPUT
  The ISAR input consists of gene sequences with protein annotations. Each gene must be defined in a separate FASTA file. 
  We provide scripts to generate the required input files for genes of interest (see 2b and 2c).

2a. ISAR INPUT FORMAT
  
  An ISAR gene FASTA file starts with the gene identifier. Additionally, the gene name, the taxonomy id and the genomic location can be provided
  (space separated). e.g.
   >>ENSG00000138069 RAB1A 9606 2:65297835-65357240:0

  The first sequence must be the entire gene sequence of the gene, e.g. 
  ATTTTGGGTGGAAGCGATAGCTGAGTGGCGGCGGCTGCTGATTGTGTTCTAGGGGACGGAGTAGGGGAAGACGTTTGCTC
  TCCCGGAACAGCCTATCTCATTCCTTTCTTTCGATTACCCGTGGCGCGGAGAGTCAGGGCGGCGGCTGCGGCAGCAAGGG
  CGGCGGTGGCGGCGGCGGCAGCTGCAGTGACATGTCCAGCATGAATCCCGAATAGTGAGTTCAGGAGAGCACCGGTCGGC
  TGGGTCCGTGGGCCAGCTTGGGGGATCTTAAAGGGGTCGAGGAGGGTTGGGGCAGAAGTCGGGGCATCGGCTGGGGTGAG
  GCGAGGGTGATGGGTCAGGAGAGGCTGGCGGCCGGGAGTCGGGCCCCATTGTCTGACGCGGAGGGGCGGCCGCGCGGGGG
  AGGGGTCGGGCCGGAGGGGTGAGCCGCCCGGGCCTGGACCGGGTCAGGTTAGAGGGCCTGACTGCGGGGCGGGTGCTGAG
  GAAGCCTGCCGAGGGGCCTGGGGCGGTGTGAAGGGGTATCTTCTCTCGGAGGCAGTGACTTTTGAAGGAGGACTTGTCTC
  ... 
  TGGGTTTTGGAGCCAGACAAACCTGAGGCCAGCCCCTAACTTTGCCACCTTCCAGATGTATGACTTCGAGCCTCACTTTC
  CTCCCCTCATAGCACCAGCTCCTCAGAAGGTAAGGATTAAATACACCCATGTGTGCAAAGCACATGGTAAGAGCTTAACG
  AACAACAGGAATTCTAATAATAAGCACCGGACTCATCCCCGTGAAA

  After that, protein annotations should be provided. A protein annotation starts with an identifier. e.g.
  >ENSP00000386672 

  After the protein identifier exon annotation should be provided. An exon annotation consists of a start position (inclusive) and
  an end position (exclusive). The position information is relative to the provided gene sequence.
  Each exon must be defined in a separate line. e.g. (five exons):
  ;191 214
  ;25300 25373
  ;39028 39124
  ;41036 41168
  ;41416 41611     

  Finally, the amino-acid or DNA sequence for the protein can be provided. e.g.
  ATGTCCAGCATGAATCCCGAATATGATTATTTATTCAAGTTACTTCTGATTGGCGACTCAGGGGTTGGAAAGTCTTGCCT
  TCTTCTTAGGTTTGCATGGGACACAGCAGGCCAGGAAAGATTTCGAACAATCACCTCCAGTTATTACAGAGGAGCCCATG
  GCATCATAGTTGTGTATGATGTGACAGATCAGGAGTCCTTCAATAATGTTAAACAGTGGCTGCAGGAAATAGATCGTTAT
  GCCAGTGAAAATGTCAACAAATTGTTGGTAGGGAACAAATGTGATCTGACCACAAAGAAAGTAGTAGACTACACAACAGC
  GAAGGAATTTGCTGATTCCCTTGGAATTCCGTTTTTGGAAACCAGTGCTAAGAATGCAACGAATGTAGAACAGTCTTTCA
  TGACGATGGCAGCTGAGATTAAAAAGCGAATGGGTCCCGGAGCAACAGCTGGTGGTGCTGAGAAGTCCAATGTTAAAATT
  CAGAGCACTCCAGTCAAGCAGTCAGGTGGAGGTTGCTGC    
  The sequence is used to check the exon annotation.

  Note: ISAR is able to infer exon annotation using the spaln2 spliced-aligner (the -spaln2 option needs to be set). Thus, it is also possible to just
  provide the amino-acid or DNA sequence for the proteins of interest. 

  See for example/RAB1A_missing.fasta.

2b. ENSEMBL PERL API
  We provide a PERL script to create the necessary ISAR input information from the ENSEMBL database for a given gene identifier
  (a internet connection is required).

  The script uses the BioPERL and the ENSEMBL PERL API. 
  See http://www.ensembl.org/info/docs/api/api_installation.html for installation instructions.  
    
  To start the script:  

  perl scripts/seq_ensembl_dl.pl 

  The following non-optional parameter(s) are missing:

		-s ENSEMBL species name    e.g. homo_sapiens, saccharomyces_cerevisiae (default: homo_sapiens) 
		-g ENSEMBL gene id         e.g. ENSG00000138069, YPT1

  See scripts/seq_ensembl_species for possible seq_names (current 66 species). 

  Examples:

  perl scripts/seq_ensembl_dl.pl -g ENSG00000138069 > RAB1A_human.fasta
  perl scripts/seq_ensembl_dl.pl -g ENSG00000174903 > RAB1B_human.fasta
  perl scripts/seq_ensembl_dl.pl -g ENSMUSG00000020149 -s mus_musculus > RAB1A_mouse.fasta
  perl scripts/seq_ensembl_dl.pl -g ENSDARG00000029663 -s danio_rerio > RAB1A_fish.fasta
  perl scripts/seq_ensembl_dl.pl -g YPT1 -s saccharomyces_cerevisiae > YPT1_yeast.fasta


2c. GENOME AND GTF files
  For batch processing it is advisable to download the entire genome and GTF (gene transfer files) files from ENSEMBL and use the provided scripts
  in order to create the ISAR inputs.

  Genomes and GTFs can be downloaded from: 
  ftp://ftp.ensembl.org/pub/release-81/fasta/
  ftp://ftp.ensembl.org/pub/release-81/gtf/ 
       
  Additional further genomes for fungi, bacteria, metazoa, plants and protists can be downloaded from: ftp://ftp.ensemblgenomes.org/pub/release-28/ 
  
  To start the script execute;
  
  java -cp ISAR-1.0.jar isoformtransfer.isar.tools.InputGenerator

  for cmdline the following non-optional parameters are not set:
	  -gtf -gene_ids -fasta_dir -out_dir

  parameters for cmdline:
	  -gtf <gene transfer file>
	  -gene_ids <comma separated list of gene identifiers>
	  -fasta_dir <genome fasta dir>
	  [-taxId <taxonomy id e.g. 9606 for human defaults to: <optional>>]
	  -filter_identical <filters proteins with identical coding exon annotations>
	  -out_dir <folder for output files (for each specified gene identifier a separate .fasta file is created)>

  Examples:

  Download and extract yeast genome and GTF:

  mkdir yeast
  cd yeast
  wget ftp://ftp.ensembl.org/pub/release-81/fasta/saccharomyces_cerevisiae/dna/*.dna.chromosome.*
  wget ftp://ftp.ensembl.org/pub/release-81/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.81.gtf.gz .
  gunzip *.gz 
  cd ..
  
  Download and extract fly genome and GTF:

  mkdir fly
  cd fly
  wget ftp://ftp.ensembl.org/pub/release-81/fasta/drosophila_melanogaster/dna/*.dna.chromosome.*
  wget ftp://ftp.ensembl.org/pub/release-81/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.81.gtf.gz
  gunzip *.gz 
  cd ..

  java -cp ISAR-1.0.jar isoformtransfer.isar.tools.InputGenerator -gtf  fly/Drosophila_melanogaster.BDGP6.81.gtf -fasta_dir fly  -gene_ids FBgn0031090,FBgn0031882 -out_dir . -taxId 7227
  java -cp ISAR-1.0.jar isoformtransfer.isar.tools.InputGenerator -gtf  yeast/Saccharomyces_cerevisiae.R64-1-1.81.gtf -fasta_dir yeast/ -gene_ids YFL038C -out_dir . -taxId 6239

 The script creates for each given gene ID a FASTA file, which can be used as input file for ISAR.

3.) PROCESS ISARs
  
  FASTA and graphical representations of an ISAR can be produced from the computed xml.gz files.

  ISAR requires the following non-optional parameters:
     -out_file -isar -out_type

  INPUT/OUTPUT:
     -isar                  FILE     input ISAR xml.gz file
     -out_file              FILE     output file
     -out_type              STR      output type
                                      Options:
                                         report_unmatched, png_simple, png_event, png, isar_mul_ali, fasta

  PNG OPTIONS:
     [-png_scale_factor         =1   Scale for output visualization]
     [-intron_size              =30  size of introns in png]
     -show_isoform_labels            

  GENERAL:
     [-org_xml                       organism configuration file]
     [-genes                STR      comma seperated list of genes to be included]


  Examples for RAB1 ISAR:
   Report unmatched regions:
     java -cp ISAR-1.0.jar isoformtransfer.isar.Formater -isar RAB1 -out_type report_unmatched -out_file RAB1.unmatched
   Draw Hasse diagram:
     java -cp ISAR-1.0.jar isoformtransfer.isar.Formater -isar RAB1 -out_type png_simple -out_file RAB1.png
   Create ordinary fasta output:
      java -cp ISAR-1.0.jar isoformtransfer.isar.Formater -isar RAB1 -out_type fasta -out_file RAB1.fasta
   Create exon-view:
      java -cp ISAR-1.0.jar isoformtransfer.isar.Formater -isar RAB1 -out_type png -out_file RAB1.png