Refine
Document Type
- Bachelor Thesis (2)
- Master's Thesis (2)
Year of publication
- 2020 (4) (remove)
Keywords
- Bioinformatik (4) (remove)
Institute
In bioinformatics one important task is to distinguish between native and mirror protein models based on the structural information. This information can be obtained from the atomic coordinates of the protein backbone. This thesis tackles the problem of distinction of these conformations, looking at the statistics of the dihedral angles’ distribution regarding the protein backbone. This distribution is visualized in Ramachandran plots. By means of an interpretable machine learning classification method – Generalized Matrix Learning Vector Quantization – we are able to distinguish between native and mirror protein models with high accuracy. Further, the classifier model supplies supplementary information on the important distributional regions for distinction, like α-helices and β-strands.
he automatic comparison of RNA/DNA or rather nucleotide sequences is a complex task requiring careful design due to the computational complexity. While alignment-based models suffer from computational costs in time, alignment-free models have to deal with appropriate data preprocessing and consistently designed mathematical data comparison. This work deals with the latter strategy. In particular, a systematic categorization is proposed, which emphasizes two key concepts that have to be combined for a successful comparison analysis: 1) the data transformation comprising adequate mathematical sequence coding and feature extraction, and 2) the subsequent (dis-)similarity evaluation of the transformed data by means of problem specific but mathematically consistent proximity measures. Respective approaches of different categories
of the introduced scheme are examined with regard to their suitability to distinguish natural RNA virus sequences from artificially generated ones encompassing varying degrees of biological feature preservation. The challenge in this application is the limited additional biological information available, such that the decision has to be made solely on the basis of the sequences and their
inherent structural characteristics. To address this, the present work focuses on interpretable, dissimilarity based classification models of machine learning, namely variants of Learning Vector Quantizers. These methods are known to be robust and highly interpretable, and therefore,
allow to evaluate the applied data transformations together with the chosen proximity measure with respect to the given discrimination task. First analysis results are provided and discussed, serving as a starting point for more in-depth analysis of this problem in the future.
In this work a second version for the Python implementation of an algorithm called Probabilistic Regulation of Metabolism (PROM) was created and applied to the metabolic model iSynCJ816 for the organism Synechocystis sp. PCC 6803. A crossvalidation was performed to determine the minimal amount of expression data needed to produce meaningful results with the PROM algorithm. The failed reproduction of the results of a method called Integrated and Deduced Regulation of Metabolism (IDREAM) is documented and causes for the failed reproduction are discussed.
Aufgrund der bedeutenden Fortschritte im Bereich der Hochdurchsatzsequenzierungstechnologien und folglich dem exponentiellen Wachstum biologischer Daten entstehen in der Bioinformatik Herausforderungen bei der Speicherung und Analyse großer Datenmengen. Die umfangreichen
Genotypisierungsdaten der Gerste, welche als Referenzdatensatz vorlagen, wurden am Leibniz-Institut für Pflanzengenetik und Kulturpflanzenforschung (IPK) durch Genotyping by Sequencing (GBS) erstellt. Zur effektiven Speicherung sowie Analyse dieser Daten wurden verschiedene Datenstrukturen erstellt und hinsichtlich Performance und Speicherbedarf evaluiert. Die Entwicklung verschiedener Java-Tools ermöglichte dabei das Einlesen, die Verarbeitung, sowie die Ausgabe dieser Daten zur effektiven Strukturierung und Analyse. Um die Anwendung dieser Java-Tools über Edge-Computing zu ermöglichen, wurde an der Erstellung von Datencontainern gearbeitet.