Refine
Document Type
- Bachelor Thesis (1)
- Master's Thesis (1)
Keywords
- Biotechnologie (1)
- Influenza-A-Virus (1)
- Proteinfaltung (1)
Institute
Diese Arbeit beschäftigt sich mit der Analyse spezieller Sequenzbereiche. Als Ausgangsdatensatz dienen Sequenzen, in welchen Reste ermittelt wurden, die den Faltungsprozess initiieren und die Bildung von Sekundärstrukturelementen unterstützen. Diese frühfaltenden Reste, auch early folding residues genannt, sind maßgeblich für das Verständnis des Proteinfaltungsprozesses. Ziel dieser Arbeit ist es, für alle early folding residues allgemein gültige Sequenzmuster zu finden und welche Importanz diese bei der Proteinfaltung spielen. Um dieses Ziel zu erreichen wurde ein Programm zu Verarbeitung der Rohdaten angefertigt und die daraus resultierenden Sequenzbereiche geclustert. Anschließend sind die Cluster in Sequenzlogos dargestellt worden.
Influenza A viruses are responsible for the outbreak of epidemics as well as pandemics worldwide. The surface protein neuraminidase of this virus is responsible, among other things, for the release of virions from the cell and is thus of interest in pharmacological research. The aim of this work is to gain knowledge about evolutionary changes in sequences of influenza A neuraminidase through different methods. First, EVcouplings is used with the goal of identifying evolutionary couplings within the protein sequences, but this analysis was unsuccessful. This is probably due to the great sequence length of neuraminidase. Second, the natural vector method will be used for sequence embedding purposes, in hopes to visualize sequential progression of the virus protein over time. Last, interpretable machine learning methods will be applied to examine if the data is classifiable by the different years and to gain information if the extracted information conform to the results from the EVcouplings analysis. Additionally to using the class label year, other labels such as groups or subtypes are used in classification with varying results. For balanced classes the machine learning models performed adequately, but this was not the case for imbalanced data. Groups and subtypes can be classified with a high accuracy, which was not the case for the years, continents or hosts. To identify the minimal number of features necessary for linear separation of neuraminidase group 1 subtypes, a logistic regression was performed at last, resulting in the identification of 15 combinations of nine amino acid frequencies. Since the sequence embedding as well as the machine learning methods did not show neuraminidase evolution over time, further research is necessary, for example with focus on one subtype with balanced data.