Volltext-Downloads (blau) und Frontdoor-Views (grau)

Prototype-based learning for sequences in molecular biology

  • Sequences are an important data structure in molecular biology, but unfortunately it is difficult for most machine learning algorithms to handle them, as they rely on vectorial data. Recent approaches include methods that rely on proximity data, such as median and relational Learning Vector Quantization. However, many of them are limited in the size of the data they are able to handle. A standard method to generate vectorial features for sequence data does not exist yet. Consequently, a way to make sequence data accessible to preferably interpretable machine learning algorithms needs to be found. This thesis will therefore investigate a new approach called the Sensor Response Principle, which is being adapted to protein sequences. Accordingly, sequence similarity is measured via pairwise sequence alignments with different sequence alignment algorithms and various substitution matrices. The measurements are then used as input for learning with the Generalized Learning Vector Quantization algorithm. A special focus lies on sequence length variability as it is suspected to affect the sequence alignment score and therefore the discriminative quality of the generated feature vectors. Specific datasets were generated from the Pfam protein family database to address this question. Further, the impact of the number of references and choice of substitution matrices is examined.

Download full text files

Export metadata

Additional Services

Search Google Scholar


Author:Julius Voigt
Advisor:Thomas Villmann, Marika Kaden
Document Type:Master's Thesis
Year of Completion:2022
Granting Institution:Hochschule Mittweida
Release Date:2023/02/07
GND Keyword:Maschinelles Lernen
Page Number:59
Institutes:Angewandte Computer‐ und Bio­wissen­schaften
DDC classes:006.31 Maschinelles Lernen
Open Access:Frei zugänglich
Licence (German):License LogoUrheberrechtlich geschützt