Context-based mapping of RNA-seq data with
ContextMap 2.0

7. Mining RNA-seq data for infections and contaminations

7.1. Indexing large amounts of reference genomes

7.2. Generating mappings against an arbitrary set of species

7.3. Obtaining confidences, genome coverages and the Jensen-Shannon divergence

1. Introduction

ContextMap is a very accurate read mapping tool for data from RNA-seq experiments. It can be used to refine mapping provided by other tools or as a standalone program.

Reads are mapped in a context-sensitive way, which means that ContextMap considers all reads aligned close to a particular read to determine its unique mapping location. A detailed description of the algorithm is given in the original publication [1].

2. Obtaining ContextMap

ContextMap is implemented in Java and can be used with unmodified versions of the following unspliced aligners:

The ContextMap source code and pre-compiled binaries for Unix and MacOS X can be downloaded from the project website.

2.1. Building ContextMap from source

The core implementation of ContextMap is realized with Java and provided as a single jar file. For running the Java code you need an installed version of Java 1.6 or later on your machine. Since Java is platform independent, it is not required to re-build the code for any operating system.

For compiling the desired unspliced aligner, see the above mentioned project websites.

3. Getting Started

Here we give a detailed description of how to prepare your data for using ContextMap.

3.1. Preparing the reference sequences

ContextMap needs as input reference sequences against which the mapping will be performed. Each reference must be provided as one fasta file and all those files must be located in a single folder. For example, for mapping against the human genome you will need a folder containing the following files: chr1.fa, chr2.fa, ..., chr22.fa, chrX.fa, chrY.fa and chrMT.fa.

3.2. Building the index for the chosen unspliced aligner

ContextMap can use different unspliced aligners to align reads against a given reference. For this purpose, ContextMap uses the index for the chosen aligner. Building the index is described for each tool on the correspendonding project websites (see 2. Obtaining ContextMap).

The indices of all the chosable aligners store the fasta header information for every reference sequence, which is used in the mapping output to identify the reference to which a particular read has been aligned. ContextMap needs the underlying reference sequences in the input (see 3.1. Preparing the reference sequences) and connects each reference name from the aligner output to a reference file. Therefore, it is important that the file name of each reference is identical to its associated fasta header. For instance, if the reference file is named chr1.fa then the fasta header of this file should be ">chr1".

3.3. Preparing the read data

ContextMap reads the raw read data either from a single fasta or fastq file. If your data is separated into two or more files, you have to concatenate them. It is also possible to apply ContextMap on paired-end data (--pairedend option). Here, it is important that the ids of mates from the same fragment have the following format: "base_name/1" and "base_name/2", whereas the "base_name" has to be exactly the same id for both mates. This already holds for Illumina sequence identifiers of version 1.4 or lower (for example: @HWUSI-EAS100R:6:73:941:1973#0/1). Illumina sequence identifiers of version 1.8 or higher (for example: @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG) are also supported by ContextMap and do not need further modifications.
Again, the reads must be provided in a single file, independent of any ordering.

4. Running ContextMap

The core implementation of ContextMap is realized in Java. Thus, you need an installed version of Java 1.6 or later on your machine for running it. All the available unspliced aligners which ContextMap can make use of can be downloaded as precompiled binaries or as source code on the corresponding project websites.

For ContextMap itself, we also provide the complete source code for downloading. ContextMap is a command line only tool and can be used as follows:

java -jar ContextMap_v2.1.0.jar mapper <main arguments> [options]*

4.1. Main arguments

All of the following main arguments must be set:

-reads	A comma separated list of file paths to reads in fasta/fastq/zip/gz format. A single path for single-end mapping and two paths (#1 mates and #2 mates) for paired-end mapping
-aligner_name	The following alignment tools are supported: "bwa", "bowtie1" or "bowtie2".
-aligner_bin	The absolute path to the executable of the chosen aligner tool.
-indexer_bin	The absolute path to the executable of the aligner's indexing tool (not needed for BWA).
-indices	A comma separated list of paths to basenames of indices, which can be used by the chosen aligner.
-genome	The path to a folder with reference sequences in fasta format (for each chromosome a separate file). It is important that the chromosome names contained in the index of the chosen unspliced alignment program (see 2. Obtaining ContextMap) are equal to the filenames of the files contained in the genome folder (for instance if the index contains a chromosome called "chr1", the genome folder must contain a file called "chr1.fa").
-o	The path to the output directory.

4.2. Options

--mining	Allows to mine for infections or contaminations. In default mode ContextMap outputs for an ambiguously aligned read in addition to its best, its second best mapping location with respect to assigned context scores. By setting this option, ContextMap will print the next best hit located on any other chromosome or genome, if available. Otherwise the second best hit located on the same chromosome/genome will be printed. This behaviour is needed for calculating confidence values and mismatch distribution of identified species. Information about the exact mapping location of the second best hit is provided via four additional SAM fields in the output (see 4.3. Output).
-skipsplit	A comma separated list of booleans, each element refers to a given aligner index (same ordering). 'true' for no split detection, 'false' otherwise. Set the value to 'true' for indices based on concatenated genome sequences to prevent ContextMap from identifying spurious junction reads across different genomes (req. in mining mode).
-speciesindex	The path to a directory containing index files created with the 'indexer' tool. Needed by ContextMap to distinguish between different genomes of concatenated sequences (req. in mining mode).
-aligner_tmp	The path to a folder for temporary alignment files. If not set, ContextMap uses a subfolder of the output directory for these files. For large data sets the alignment step may generate very large alignment files (up to 100GB or more).
-seed	The seed length for the alignment (default: Bwt1: 30, BWA/Bwt2: 20).
-seedmismatches	Allowed mismatches in the seed region (default: Bwt1: 1, BWA/Bwt2: 0).
-mismatches	Allowed mismatches in the whole read (default: 4).
-mmdiff	The maximum allowed mismatch difference between the best and second best alignment of the same read (default: 0).
-maxhits	The maximum number of candidate alignments per read. Reads with more hits are skipped (bwa/bwt1) or the already found hits are reported (bwt2) (default for bwa/bwt1:10, bwt2: 3).
-minsize	The minimum number of reads, which a genomic region has to contain for being regarded as a local context (default: 10).
-annotation	The path to an annotation file in our own format (provided with the ContextMap package).
-gtf	The path to an annotation file in GTF format. This option is mutually exclusive with the -annotation option.
-t	The number of threads used for the run (default: 1).
--noclipping	Disables the calculation of clipped alignments (default: not set).
--noncanonicaljunctions	Enables the prediction of non-canonical splice sites (default: not set).
--strandspecific	Enables strand specific mapping (default: not set).
--polyA	Enables the search for polyA-tails. This option is mutually exclusive with the --noclipping option. (default: off)
--sequenceDB	Writes a readId -> sequence mapping to disk. Recommended for very large data sets. (default: off)
--printmultimappings	Forces ContextMap to print out all possible multi-mappings to an additional file (default: not set).
--verbose	The verbose mode, which generates a more detailed output in every step (default: not set).
--keeptmp	ContextMap does not delete most of the temporary files (default: not set).

4.3. Output

The final mapping will be printed in the SAM format, which contains the following fields:

The read name
The sum of the bitwise flag. Currently, ContextMap uses three relevant flag:
1. The read has been aligned to the reverse reference strand (Bit 5, 2^5=16).
2. The read is the first segment in the template (Bit 7, 2^7=128).
3. The read is the last segment in the template (Bit 8, 2^8=256).
The reference sequence name of the alignment (e.g.: chr1 for an alignment to chromosome 1).
The position of the alignment on the reference sequence (1-based).
The mapping quality, which is always set to 255 (unavailable).
The CIGAR string of the alignment. We only use the "M" (alignment match) and "N" (skipped region from the reference) symbols to represent the alignment.

Fields 7-11 are not set, which is represented either by a "0" or a "*". For reads with several possible mapping positions, the SAM file contains four additional columns:

The reference sequence name of the second best alignment.
The position of the second best alignment on the reference sequence.
The score of the best alignment.
The score of the second best alignment.

5. Performance Tuning

The largest performance gain can be obtained when ContextMap is run as a multi-threaded application (-t option). Considering threading efficiency we obtained best results using 6-12 threads, depending on the used machine and the available cores.

We recommend to use the "-Xms" and "-Xmx" option to set the minimum and maximum heap size of the java virtual machine, respectively. For mappings to genomes like human you have to reserve at least 4 GB of RAM and add for each thread another GB. For instance, if you are working with 8 threads, we suggest to set: -Xms4000M -Xmx12000m. For large data sets (> 80 Mio reads) the maximum heap size may be set to 20-30GB.

When running ContextMap in multi-threaded mode together with a large heap we experienced best results when using the concurrent garbage collector together with a relatively small heap size for the young generation:

-XX:+UseConcMarkSweepGC -XX:NewSize=300M -XX:MaxNewSize=300M

A detailed description of how to set the parameters of the Java Virtual Machine can be found here and in the script of the example call.

If ContextMap still performs too slow, needs too much memory or hard disk space, we suggest to do the following:

Increase the seed size (-seed).
Reduce the number of allowed mismatches in the seed (-seedmismatches).
Reduce the number of allowed mismatches for the whole read (-mismatches).
Reduce the maximum allowed mismatch difference between the best and second best alignment (-mmdiff).
Set the maximum number of allowed hits per read to a lower value (e.g.: 1000) (-maxhits).

Note, that all these changes may decrease the sensitivity of ContextMap.

6. Example call

For demonstrating the usage of ContextMap, we simulated human reads with the Flux Simulator [3] and extracted read sequences of a particular Ensembl gene on chromosome 1 (ENSG00000198746). As a reference sequence, we use a part of chromosome 1, which covers the mentioned gene. This sequence is used to build an index for the desired unspliced aligner. All this data is contained in the ContextMap archive, including a shell script to show how ContextMap can be called. If you execute this shell script, you have to enter name of the desired unspliced aligner and the path to the binary and the indexer (if not BWA).
To start ContextMap, it is sufficient to execute the script. In order to apply ContextMap on your own data, you simply can modify the given script.

7. Mining RNA-seq data for infections an contaminations

In the following it is described how ContextMap can be used to mine for infections and contaminations in RNA-Seq data. The methods implemented for this task are introduced in our recently published paper [4].

7.1. Indexing large amounts of reference genomes.

When searching for unknown infections or contaminations in RNA-seq data, thousands of different microbial and viral genomes have to be considered as potential candidates. However, the original Bowtie implementation was limited to index only a relatively small amount of different fasta sequences (<= 10000) and also ContextMap works best when the reference sequence directory does not contain thousands of different fasta files. Therefore, the first step is to concatenate microbial and viral genomes. This can be easily done with the 'indexer' tool provided with the ContextMap package.
Note, the split detection for reads aligned to concatenated sequences is not supported and has to be disabled by the user with the -skipsplit option (see the following section).

The indexer tool concatenates the entries of a given multi-fasta file and inserts a sequence of N's between two entries to omit read alignments overlapping subsequent entries. The output is a set of fasta files, each containing at most a pre-defined maximum number of bases (default: 250.000.000). Furthermore, '*.idx' files will be created, which allow to reconstruct the original sequences of the input. Every '*.idx' file is associated to one of the created fasta files and can later be used by other tools contained in the ContextMap package. The 'indexer' can be used as follows:
java -jar ContextMap_2.2.1.jar indexer <main arguments> [options]*
Main arguments:

-fasta	The path to a multi fasta file, which should be indexed.
-prefix	The basename of the index (e.g. 'microbes').
-o	The path to the output directory.

Options:

-gapsize	The number of N's inserted between two subsequent entries of the input (default: 100).
-indexsize	The maximum number of bases a index file contains (default: 250000000).

7.2. Generating mappings against an arbitrary set of species

Once all candidate sequences for infections and contaminations have been indexed with the 'indexer' tool, one has to build an index with the indexer tool of the desired unspliced aligner for the set of generated fasta files. Subsequently, it is necessary that all fasta files (including the chromosome files of a reference species like human), against which a mapping will be determined, are located in the same directory.
Now you can start ContextMap like it is described in the Running ContextMap section. The directory containing the fasta sequences can be provided with the -genome option and an arbitrary set of indices can be provided with the -indices option via a comma separated list of paths to the basenames of the indices. ContextMap has three options, which have to be set when mining for infections or contaminations:

--mining	Allows to mine for infections or contaminations. In default mode ContextMap outputs for an ambiguously aligned read in addition to its best, its second best mapping location with respect to assigned context scores. By setting this option, ContextMap will print the next best hit located on any other chromosome or genome, if available. Otherwise the second best hit located on the same chromosome/genome will be printed. Information about the exact mapping location of the second best hit is provided via four additional SAM fields in the output (see 4.3. Output).
-skipsplit	A comma separated list of booleans, each element refers to a given index of the unspliced aligner (same ordering). 'true' for no split detection, 'false' otherwise. Set the value to 'true' for indices based on concatenated genome sequences to prevent ContextMap from identifying spurious junction reads across different genomes.
-speciesindex	The path to a directory containing index files created with the 'indexer' tool. Needed by ContextMap to distinguish between different genomes of concatenated sequences.

7.3. Obtaining confidences, genome coverages and the Jensen-Shannon divergence

For determining the set of species contained in a given sam file together with statistics (introduced in [4]) necessary for an accurate classification, the ContextMap packages contains the 'inspector' tool, which can be used as follows:
java -jar ContextMap_2.2.1.jar inspector <main arguments> [options]*
Main arguments:

-sam	The path to a mapping file in sam format.
-idx	The path to a directory containing index files, which were generated with the 'indexer' tool.

Options:

-reference	A comma seprerated list of chromosome names or a single species identifier, which will be used as a reference for the Jensen-Shannon divergence calculation. if not set, the most abundant species in the sample will be used.
-refseqcatalog	The path to a RefSeq *.catalog file. Trivial species names will be extracted in case they are not available in the index files.
--mergecontigs	Mappings to different contigs of the same species will be merged.
--mdflag	Uses the MD field of the sam file to evaluate mismatch counts. Per default, the NM field is used.

The output of the inspector tool is a table which contains read counts, genome coverages, confidence values as well as the square root of the Jensen-Shannon divergence for every identified species.

7.4. Example

The ContextMap package contains the directory 'mining_example', which itself contains data for demonstrating the mining process from the very beginning. The underlying read data is a subset of an in-vitro simulated microbial community [5] and the reference sequences are genomes of species being part of this community. By executing the 'run_example.sh' script, the following steps will be performed:

Indexing the reference genomes
The reference sequences are located in two different files (genomes_0.fa and genomes_1.fa), which could represent in a real life scenario all microbial and all viral genomes, respectively. Indexing the sequences is done via the indexer tool with the following commands:
java -jar ../bin/ContextMap_v2.3.0.jar indexer -fasta genomes_0.fa -prefix microbesA -o indexed_references
java -jar ../bin/ContextMap_v2.3.0.jar indexer -fasta genomes_1.fa -prefix microbesB -o indexed_references
The output is written to the 'indexed_references' directory (-o option) and the base names of the newly generated fasta and index files are defined via the -prefix option.
Building indices for the unspliced aligner
The next step is to build indices for the concatenated reference genomes. This is done with the indexer program of the desired unspliced aligner via the following command:
$indexer_bin -f indexed_references/fasta/microbesA_0.fa aligner_index/demo_index_A
$indexer_bin -f indexed_references/fasta/microbesB_0.fa aligner_index/demo_index_B
We store the indices in the directory 'aligner_index' and name the indices 'demo_index_A' and 'demo_index_B', respectively.
Running ContextMap
Now, we are ready to execute ContextMap via the following command:
java -jar ../bin/ContextMap_v2.3.0.jar mapper --mining -aligner_name $aligner_name -aligner_bin $aligner_bin -indexer_bin $indexer_bin -reads reads.fa -o mapping -indices aligner_index/demo_index_A,aligner_index/demo_index_B -genome indexed_references/fasta -skipsplit true,true -speciesindex indexed_references/indices
The --mining enables a ContextMap-Mining run. The -indices option changed to previous ContextMap versions by accepting now not only one but a list of indices. Here the two indices 'demo_index_A' and 'demo_index_B' are provided, respectively. Furthermore, a mining run requires to use the -skipsplit option, which expects a list of booleans with which the user is able to tell ContextMap for which of the provided indices the split read detection can be disabled. Each element of this list refers to an element of the list provided with the -indices (the first boolean refers to the first index and so on). Since we are working in this example with two indices based on concatenated genomes, we skip the split detection for both of them. Finally, the -speciesindex option points to the directory where all *.idx files (created with the indexer tool) are located. All other options are used like it is described in the 'Running ContextMap' sections.
Obtaining confidences, genome coverages and the Jensen-Shannon divergence
The set of species contained in the input together with statistics (introduced and described in detail in [4]) necessary for an accurate classification can be determined with the 'inspector' tool of the ContextMap package:
java -jar ../bin/ContextMap_v2.3.0.jar inspector -sam mapping/mapping.sam -idx indexed_references/indices
With the -sam option the mapping file created in the ContextMap run is provided and the -idx option expects the path to the directory, where all *.idx files are located. The final output is a table, which contains for every identified species read counts, the genome coverage, confidence values and the square root of the Jensen-Shannon divergence.

8. References

[1] Thomas Bonfert, Gergely Csaba, Ralf Zimmer, Caroline C. Friedel.
A context-based approach to identify the most likely mapping for RNA-Seq experiments.
BMC Bioinformatics, vol13(Suppl 6), pp. S9, 2012

[2] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.
Genome Biol, 10(3):R25, 2009

[3] The Flux Project (2011). Flux simulator version 1.0-RC4
Flux simulator website

[4] Thomas Bonfert, Gergely Csaba, Ralf Zimmer, Caroline C. Friedel.
Mining RNA-seq Data for Infections and Contaminations.
PloS one, vol 8, pp. e73071, 2013

[5] Morgan JL, Darling AE, Eisen JA.
Metagenomic sequencing of an in vitro-simulated microbial community.
PLoS One, 5: e10209, 2010

[6] B. Langmead, S. Salzberg.
Fast gapped-read alignment with Bowtie 2.
Nature Methods, 9:357-359, 2012

[7] H. Li, R. Durbin.
Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics, 25, 1754-1760, 2009

Context-based mapping of RNA-seq data with ContextMap 2.0

Table of contents

7.1. Indexing large amounts of reference genomes.

Context-based mapping of RNA-seq data with
ContextMap 2.0