Sequencing of RNA (RNA-seq) using next generation sequencing has become the standard approach for profiling the transcriptomic state of a cell. This requires mapping of the sequencing reads to determine their transcriptomic origin.
Recently, we developed a context-based mapping approach, ContextMap, which determines the most likely origin of a read by evaluating the context of the read in the form of alignments of other reads to the same genomic region. In the original implementation, the focus was on improving initial mappings provided by other mapping tools.
Here, we present ContextMap 2.0, an extension of the original ContextMap method, which can also be used as a standalone tool without relying on initial mappings by other tools. We show that it yields highly accurate read mappings and is very robust against sequencing errors. The design of ContextMap 2.0 allows for massively parallelized data processing, resulting in reasonable running times despite the higher complexity of the context-based approach.
The standalone ContextMap 2.0 algorithm consists of
five major steps
(see figure below and
In the first step, ContextMap 2.0 aligns reads to a given reference genome using a modified Bowtie version that performs alignments in forward and backward direction to also detect reads from exon-exon junctions (split reads).
These initial alignments are then used to calculate contexts, defined as reads originating from the same stretch of genome. For this purpose, ContextMap clusters read alignments based on their genomic starting position, allowing multiple alignments of reads. Extension of alignments and identification of the most likely mapping for each read is then performed independently for each context (steps C-D) with integration performed only in the last step (E). This strategy allows ContextMap to make heavy use of multi-core machines by processing many contexts in parallel.
Furthermore, a large number of additional candidate alignments can be created for each read with only little influence on runtime. Here, ContextMap creates all possible alignments for each read satisfying the maximum mismatch criterion, including e.g. additional split read alignments derived from full read alignments that overlap with a previously identified splice site.
(D) + (E)
Resolution of the many ambiguous alignments for each read is performed in steps (D) and (E), first within contexts (D) and subsequently between contexts (E). Both of these resolution steps are based on a scoring scheme that takes into account the number of reads aligned within and around a particular read alignment. If a transcriptome annotation is provided, ContextMap prefers candidate split read alignments corresponding to known junctions. In both (D) and (E) the alignment with the highest support score is chosen for each read instead of simply the alignment with the minimum number of mismatches, resulting in a unique mapping for each read first within each context (step D) and finally across all contexts (E).
As recently shown, ContextMap considerably improves mapping accuracy of initial mappings by other tools (
see the original publication
Here, we demonstrate on in-silico human and mouse RNA-seq data sets with different error rates that ContextMap 2.0 in standalone mode also outperforms other state-of-the-art methods with regard to alignment accuracy (see Table below and
). This includes both methods using transcriptome annotations as well as genome-only approaches. Here, alignment accuracy of ContextMap 2.0 was generally higher than for the compared methods on all data sets with the highest improvement observed for 2% error rates. In contrast to the other approaches, ContextMap was only little influenced by the increased sequencing error rates.
Current release version: 2.1.1
Support of paired-end data
Support of strand-specific data
Runtime and memory usage improvements at the global resolution step
Added 'XS' tag to the SAM output. ContextMap can now be used together with Cufflinks