ISAR: Isoform Structure Alignment Representation
The structures of eukaryotic genes are complex and complicated. Many coding sequences have been observed and are being observed through various experimental techniques. A convenient and comprehensive cross-species representation of isoforms can help to comparatively analyze the expression, function and evolution of (alternative) transcripts. We address this issue by introducing the Isoform Structure Alignment Representation (ISAR). ISAR is a data structure (iDAG) and algorithm for a gene, transcript, and exon-intron structure aware and consistent Multiple Sequence Alignment (MSA) of isoforms from sets of orthologous and paralogous genes. An efficient algorithm constructs ISAR(iDAG)s from large sets of gene and isoform sequences by successively integrating highly confident candidate alignments. The approach is based on partially ordered sets and novel operations allowing the representation of maximal consistent alignments in a sparse graph data structure. Candidate alignments are obtained from diverse sources enabling the integration and conversion of any set of given alignments into an ISAR. The ISARs allow for the systematic classification and detailed exploration of the exon-intron structure across large sets of phylogenetic taxa and the efficient prediction of new isoforms across phylogenetically distant species.