Master Praktikum I: EmpiReS@ISAR

EmpiReS@ISAR

Isoform-basierte Analyse von Hochdurchsatz-Expressionsdaten - Analyse von Krebs-, Atherosklerose- oder COVID-19-Daten Isoform-aware analysis of high-throughput expression data – Analysis of cancer, atherosclerosis, or COVID-19 data

Die Seite befindet sich noch im Aufbau!

BetreuerCourse instructuors

Allgemeine InformationenGeneral Information

Credits und ArbeitsumfangCredits and work load: 12 ECTS / 10 SWS (10P/Block) = 360 working hours
Zeit (während des Semesters): Date (during the semester): Di + Do Tue + Thu 13-18h: ~300h
Zeit (Block): Date (block phase): 1-2 Wochen: 1-2 weeks: ~60h
Raum: Hiwi-Räume Room: Hiwi-rooms + 406 Amalienstr. 17
Betreuer:Supervisors: Prof. Dr. Ralf Zimmer, Armin Hadziahmetovic

Thema/Beschreibung/InhaltTopic/Description/Contents

Expressionsdaten aus Sequenzierungsexperimenten sind in großem Umfang für viele Arten, Gewebe, Zellen und Bedingungen verfügbar. Die Differentielle Genexpressions-Analyse (DGEA) ist ein wichtiges Werkzeug [1], um die Funktionen von Genen und Signalwegen zu verstehen. Dazu muss die bioinformatische Datenanalyse in großem Maßstab genau und effizient durchgeführt werden. Die Ergebnisse vieler DGEA-Analysen müssen kombiniert, verglichen und visualisiert werden. Sequencing based expression data is available on a large scale for many species, tissues, cells and conditions. Differential gene expression analysis (DGEA) is a major analysis tool [1] to understand functions of genes and pathways. Thus, DGEA has to be performed accurately and efficiently on a large scale. Results of many DGEA analysis need to be combined, compared and visualized.

Die wichtigsten Ressourcen für die DGEA sind Expressionsdaten aus öffentlichen Quellen wie recount3 [2] und Rohsequenzierungsdaten aus Kompendien wie SRA, GTEx, ENCODE3 oder TCGA [3-8]. Genstrukturen werden für eine isoforme Zuordnung von Sequenzierdaten zu Gen-Modellen benötigt. Die Isoform Structure Alignment Representation (ISAR) liefert alignierte Transkript-Isoformen von Genen aus einer Reihe von Arten. EmpiReS ist eine allgemeine Methode zur Identifizierung sowohl von DE-Genen als auch von Differential Alternative Splicing (DAS) auf der Grundlage der ISAR-Genmodelle. Der evolutionäre Isoform-Browser (eiBrow) wird verwendet, um identifizierte DE- und DAS-Ereignisse zu visualisieren, einschließlich Abdeckung, Faltungsänderung und Kreuzungslesedaten sowie Bindungs- und Spleiß-Donor- und -Akzeptorstellen. Unterschiede und Übereinstimmungen zwischen konkurrierenden Methoden werden analysiert und systematisch betrachtet. Major resources for the DGEA are expression data from public resources such as recount3 [2] and raw sequencing data from compendia such as SRA, GTEx, ENCODE3 or TCGA [3-8]. Gene structures are needed for a isoform-aware mapping of sequencing reads to gene models. The Isoform Structure Alignment Representation (ISAR) provides aligned transcript isoforms of genes across a number of species. EmpiReS is a general method to identify both DE gens as well as Differential Alternative Splicing (DAS) based on the ISAR gene models. The evolutionary isoform Browser (eiBrow) will be used to visualize identified DE and DAS events including coverage, fold change and junction read data as well as binding as well as splice donor and acceptor sites. Differences and consensus between competitive methods will be analysed and considered systematically.

Im Rahmen des Praktikums wird eine Pipeline für die flexible Analyse von Krankheits-Expressionsdaten entwickelt, um die differentielle Expression und das differentielle alternative Spleißen für verschiedene Krankheiten zu identifizieren, zu validieren und zu visualisieren, z. B. Krebs, Atherosklerose und COVID-19 [9]. Ziel der Pipeline ist es, einen Vergleich von projektspezifischen Sequenzierungsdaten mit der Fülle von Expressionsdaten in öffentlichen Repositories zu ermöglichen. The practical course will build a pipeline for a flexible analysis of disease expression data to identify, validate and visualize differential expression and differential alternative splicing for various diseases, e.g. cancer, atherosclerosis, and COVID-19 [9]. The goal of the pipeline is to allow a comparison of project-specific sequencing data with the wealth of expression data in public repositories.

Ziele und Lernziele:
Die Pipeline wird auf verfügbaren hochmodernen Tools für eine effiziente Analyse und komfortable Visualisierung der Ergebnisse unter Verwendung moderner Python und R Programmierumgebungen und -pakete aufbauen. Die Visualisierung erfolgt durch benutzerfreundliche Shiny oder Dash Apps. Die Robustheit und Reproduzierbarkeit der Ergebnisse ist eine wichtige Voraussetzung für alle Implementierungen. Aims and Learning Goals:
The pipeline will build on available state-of-the-art tools for efficient analysis and comfortable visualization of results using modern python and R programming environments and packages. Visualization will be done by user-friendly Shiny or Dash Apps. Robustness and reproducibility of results is an important requirement for all implementations. The work will be summarized in presentation and a scientific paper (to be submitted to a journal for peer review)

Voraussetzungen:
Bachelor Bioinformatik, insbesondere erfolgreicher Abschluss des GoBi-Moduls. Gute Programmierkenntnisse (Java, Python, Dash, R, Shiny). Interesse an Datenvisualisierung und komplexen menschlichen Krankheiten. Kenntnisse in Bildverarbeitung und -analyse sind von Vorteil (können auch im Praktikum erlernt werden). Prerequisites:
Bachelor Bioinformatics, in particular successful completion of the GoBi module. Good programming skills (java and/or python). Interest in data visualization and complex human diseases. Knowledge on image processing and analysis is advantageous (can also be learned during the practical).

Struktur/Zeitablauf des PraktikumsStructure/Schedule

Feb/Mar 2022: Kickoff meeting and project assignment
Apr-Jul 2022: 300h project and paper planning, project work, presentations and discussions
Jul-Aug-Sep 2022: 60h block phase, project work, paper writing, final presentation and paper submission

Feb/Mar 2022: Kickoff meeting und Zuordnung der Projekte und Teams
Apr-Jul 2022: ~300h Projekt und Paper Planung, Projektarbeit, Zwischen-Präsentationen und Diskussionen
Jul-Aug-Sep 2022: ~60h Block Phase, Projektarbeit, Schreiben des Papers, Abschlusspräsentation und Einreichen des Papers

VorkenntnissePrerequisites

Grundstudium Bioinformatik (Bachelor oder Diplom)Bachelor Bioinformatics
Programmierpraktikum BioinformatikBioinformatics programming course
Praktikum Genomorientierte BioinformatikPractical Genome-oriented bioinformatics
Gute Programmierkenntnisse (Bachelor Level)Good programming skills (bachelor level)

Interne WebseiteInternal web page

Mit Beginn des Praktikums werden alle nötigen Materialien auf einer internen Seite veröffentlicht At the beginning of the semester all required material will be provided at the internal Webpage (NEAP2022 EmpiReS@ISAR Interne Webseite)(NEAP2022 EmpiReS@ISAR Internal web page)

LiteraturLiterature

[1] Susan Holmes, Wolfgang Huber, ‎ Modern Statistics for Modern Biology,Cambridge University Press, 2019.
[2] Wilks C, Zheng SC, Chen FY, Charles R, Solomon B, Ling JP, Imada EL, Zhang D, Joseph L, Leek JT, Jaffe AE, Nellore A, Collado-Torres L, Hansen KD, Langmead B. recount3: summaries and queries for large-scale RNA-seq expression and splicing. Genome Biol. 2021 Nov 29;22(1):323. doi: 10.1186/s13059-021-02533-6. PMID: 34844637; PMCID: PMC8628444.
[3] The SRA Toolkit Development Team, https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
[4] Burgess DJ. Reaching completion for GTEx. Nat Rev Genet. 2020 Dec;21(12):717. doi: 10.1038/s41576-020-00296-7. PMID: 33060849.
[5] GTEx Consortium. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015 May 8;348(6235):648-60. doi: 10.1126/science.1262110. Epub 2015 May 7. PMID: 25954001; PMCID: PMC4547484.
[6] ENCODE: https://www.genome.gov/Funded-Programs-Projects/ENCODE-Project-ENCyclopedia-Of-DNA-Elements, ENCODE integrative analysis (PMID: 22955616), ENCODE portal (PMID: 29126249)
[7] The TCGA Research Network: https://www.cancer.gov/tcga.
[8] Wu M, Shang X, Sun Y, Wu J, Liu G. Integrated analysis of lymphocyte infiltration-associated lncRNA for ovarian cancer via TCGA, GTEx and GEO datasets. PeerJ. 2020 May 7;8:e8961. doi: 10.7717/peerj.8961. PMID: 32419983; PMCID: PMC7211406.
[9] Wyler, E., Mösbauer, K., Franke, V., Diag, A., Gottula, L. T., Arsiè, R., Klironomos, F., Koppstein, D., Hönzke, K., Ayoub, S., et al. (2021) Transcriptomic profiling of SARS-CoV-2 infected human cell lines identifies HSP90 as target for COVID-19 therapy. Iscience, 24(3), 102151.

Jeff Sutherland, SCRUM The Art of Doing Twice the Work in Half the Time, Random House, 2015
Scott Morgan & Barrett Whitener, Speaking about Science - A Manual for Creating Clear Presentations, Cambridge University Press, 2006