Efficient algorithms and space-efficient data structures play an
important role in Bioinformatics. Without new, clever, and ingenious
algorithmic ideas the recent challenges from modern Life Sciences could
not been handled. Thus, the recent progress in Bioinformatics and System
Biology relies heavily on these algorithmic concepts. Clearly, the
problems covered by Algorithmic Bioinformatics and which we have
addressed stem from various areas of Life Sciences.
Succinct Data Structures in Bioinformatics
Range Minimum Queries and Suffix Arrays
The Range-Minimum-Query-Problem (RMQ) is to preprocess an array in
linear time such that all subsequent queries asking for the position of
a minimal element between two specified indices can be obtained in
constant time. The first algorithm that never uses more than 2n+o(n)
bits has been developed, which does not rely on rank- and select-queries
or other succinct data structures. The importance of this result is
stressed by simplifying and reducing the space consumption of the
Enhanced Suffix Array, while retaining all its capabilities to simulate
Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays
, SIAM Journal on Computing, vol. 40, no. 2, pp. 465-492, 2011.
Compressed Range Minimum Queries
It is shown that for compressible input arrays the RMQ-information can
be compressed as well. In particular, the information for RMQs can be
stored within the same entropy bounds that are achieved by the currently
best schemes for storing the underlying array itself in compressed form,
while still being able to access a logarithmic number of contiguous bits
in constant time. Evaluations show that the practical space consumptions
of the non-compressed schemes scale surprisingly well with their
theoretical guarantees, and for compressible input arrays the new
compressed schemes can indeed reduce the space, with little or no
slowdown in query time.
, Horst Martin Stühler,
Practical Entropy-Bounded Schemes for O(1)-Range Minimum Queries
, Proceedings of the 2008 Data Compression Conference (DCC’08) (J.A. Storer, M.W. Marcellin, eds.), IEEE Computer Society, Snowbird, Utah, U.S.A., March 25-27, 2008, pp. 272-281, 2008.
Range Median of Minima Queries
A natural extension of RMQ is considered, where a further constraint is
that if the minimum in a query interval is not unique, then the query
should return an approximation of the median position among all
positions that attain this minimum. A succinct preprocessing scheme
using Dn+o(n) bits in addition to the static input array (for a small
constant D) has been developed, such that subsequent range median of
minima queries can be answered in constant time. This data structure can
be built in linear time, with little extra space needed at construction
time. Several new combinatorial concepts are introduced such as
Super-Cartesian Trees and Super-Ballot Numbers, which are believed to
have other interesting applications in the future.
Range Median of Minima Queries, Super-Cartesian Trees, and Text Indexing
, Proceedings of the International Workshop on Combinatorial Algorithms (IWOCA’08) (Mirka Miller, Koichi Wada, eds.), College Publications, vol. 12, September 13-15, 2008, Nagoya, Japan, pp. 239-252, 2010.
Finding Range Minima in the Middle: Approximations and Applications
, Mathematics in Computer Science, vol. 3, no. 1, pp. 17-30, 2010.
String Mining in Bioinformatics
A new algorithmic framework has been developed for solving
frequency-related data mining queries on databases of strings in optimal
time, i.e., in time linear in the input and the output size. The
additional space is linear in the input size. This framework can be used
to mine frequent strings, emerging strings, and strings that pass other
statistical tests, e.g., the 2-test. The advantages of array-based data
structures (compared with dynamic data structures such as trees) are
good locality behavior and extensibility to secondary memory. The
algorithm is tested on real-world biological data and demonstrates that
the approach also works well in practice.
, Stefan Kramer,
Optimal String Mining Under Frequency Constraints
, Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’06) (J. Fürnkranz, T. Scheffer, M. Spiliopoulou, eds.), Springer-Verlag, vol. 4213, Berlin, Germany, September 18-22, 2006, pp. 139-150, 2006.
Algorithms for Evolutionary Aspects
Hybroscale - Computation of hybridization networks
Hybroscale is developed specifically for the research of hybridization networks including its computation and visualization.
Celine Scornavacca, Simone Linz,
A First Step Toward Computing All Hybridization Networks For Two Rooted Binary Phylogenetic Trees
, Journal of Computational Biology, vol. 19, no. 11, pp. 1227-1242, 2012.
, Celine Scornavacca, Alberto Cenci, Daniel H Huson,
Fast computation of minimum hybridization networks
, Bioinformatics, vol. 28, no. 2, pp. 191-197, 2011.
Sorting by Prefix reversals
Sorting by Prefix Reversals, also known as Pancake Flipping, is the
problem of transforming a given permutation into the identity
permutation, where the only allowed operations are reversals of a prefix
of the permutation. The problem complexity is still unknown. The first
polynomial-time 2-approximation algorithm to solve this problem has been
developed. Empirical tests suggest that the average performance is in
fact better than 2.
, Simon W. Ginzinger,
A 2-Approximation Algorithm for Sorting by Prefix Reversals
, Proceedings of the 13th Annual European Symposium on Algorithms (ESA’05) (Gerth Stølting Brodal, Stefano Leonardi, eds.), Springer-Verlag, vol. 3669, Mallorca, Spain, October 3-6, 2005, pp. 415-425,, 2005.
Genome-Rearrangements: Sortieren mit erweiterten Transreversals
, Diploma Thesis, LFE Bioinformatik / LMU München, 2009.
Reconstructing Ultrametric Trees
The reconstruction of an ultrametric tree from a distance matrix is a
very frequent subproblem in clustering or reconstructing evolutionary
trees, both are common problems in Bioinformatics. In his famous book,
Gusfield presented a very simple, but not time-optimal recursive
algorithm for this problem (see also errata of Gusfield's book). It has
been shown that a simple modification of Gusfield's algorithm allows a
time-optimal solution, which is the first simple optimal algorithm for
Analysis of a Modification of Gusfield’s Recursive Algorithm for Reconstructing Ultrametric Trees
, Information Processing Letters, vol. 108, no. 4, pp. 222-225, 2008.
Combinatorial Protein Folding
The extended cubic lattice is a natural extension of the cubic lattice
for the HP model proposed by Dill et al that bypasses its major
drawback, its bipartiteness, by adding plane diagonals. In this model,
general folding algorithms which achieve an approximation ratio of 59/70
for all protein sequences and an approximation ratio of 37/42 for a
restricted but quite natural subset of HP-sequences have been proposed.
Approximate Protein Folding in the HP Side Chain Model on Extended Cubic Lattices
, Discrete Applied Mathematics, vol. 127, no. 1, pp. 163-177, 2003.
Experimental Protein Structure Determination
Analysis of Chemical Shift Data for NMR Structure Determination
The goal of this project is to achieve a considerable speed-up in the NMR structure solving process. Therefore computational methods are developed which allow the creation of models for a protein structure at an experimental stage in which this is not possible by hand. These models may then be refined and tested for correctness using the "standard" way. The time saving results from the need for fewer experiments in when building the initial model.
Automatic Correction of Inconsistent Chemical Shift Referencing
The construction of a consistent protein chemical shift database is an important step toward making more extensive use of this data in structural studies. Nowadays available data is frequently impurified, in particular with respect to chemical shift referencing, which is often either inaccurate or inconsistently annotated. CheckShift is at tool for preprocessing chemical shift data to detect and correct referencing errors.