Refine
Document Type
- Master's Thesis (27)
- Bachelor Thesis (17)
Keywords
- Maschinelles Lernen (44) (remove)
Institute
- Angewandte Computer‐ und Biowissenschaften (44) (remove)
Drought is one of the most common and dangerous threats plants have to face, costing the global agricultural sector billions of dollars every year and leading to the loss of tons of harvest. Until people drastically reduce their consumption of animal products or cellular agriculture comes of age, more and more crops will need to be produced to sustain the ever growing human population. Even then, as more areas on earth are becoming prone to drought due to climate change, we may still have to find or breed plant varieties more suitable to grow and prosper in these changing environments.
Plants respond to drought stress with a complex interplay of hormones, transcription factors, and many other functional or regulatory proteins and mapping out this web of agents is no trivial task. In the last two to three decades or so, machine learning has become immensely popular and is increasingly used to find patterns in situations that are too complex for the human mind to overlook. Even though much of the hype is focused on the latest developments in deep learning, relatively simple methods often yield superior results, especially when data is limited and expensive to gather.
This Master Thesis, conducted at the IPK in Gatersleben, develops an approach for shedding light on the phenotypic and transcriptomic processes that occur when a plant is subjected to stress. It centers around a random forest feature selection algorithm and although it is used here to illuminate drought stress response in Arabidopsis thaliana, it can be applied to all kinds of stresses in all kinds of plants.
There are multiple ways to gain information about an individual and its health status, but an increasingly popular field in medicine has become the analysis of human breath, which carries a lot of information about metabolic processes within the individuals body. The information in exhaled breath consists of volatile (organic) compounds (VOCs). These VOCs are products of metabolic processes within the individuals body, thus might be an indicator for diseases disturbing those processes. The compounds are to be detected by mass-spectrometric (MS) or ion-mobility spectrometric (IMS) techniques, making the analysis of these compounds not only bounded to exhaled breath. The resulting data is spectral data, capturing concentrations of the VOCs indirectly through intensities. However, a number of about 3000 VOCs [1] could already be determined in human exhaled breath. The number of research paper about VOC-analysis and detection had risen nearly constantly over the last decade 1. Furthermore, the technique to identify VOCs could also be used to capture biomarker from alien species within the individuals body. Extracting VOCs from an individual can be done by non- or minimal invasive techniques. However, the manual identification of VOCs and biomarkers related to a certain disease or infection is not feasible due to the complexity of the sample and often unknown metabolic products, thus automized techniques are needed. [1–4] To establish breath analysis as a diagnosis tool, machine learning methodes could be used. Machine learning has become a popular and common technique when dealing with medical data, due to the rapid analysis. Taking this advantage, breath analysis using machine learning could become the model of choice for diagnosis, keeping in mind that conventional methodes are laboratory based and thus when trying detect bacterial infection need sometimes several days to identify the organism. [5]
Active Learning (AL) ist eine besondere Trainingsstrategie im überwachten maschinellen Lernen, mit dem Ziel die Accuracy eines Klassifikators zu verbessern, indem ein Klassifikator mit nur wenig gelabelten, aber dafür hoch informativen Datenpunkten (DP) gelernt wird. In der medizinischen Forschung liegen oftmals nur wenig gelabelte DP vor. AL kann eine sinnvolle Strategie sein, um die Kosten und den Aufwand für das Labeln ungelabelter DP zu senken. Mit Pool-Based AL wurden bisher die größten Erfolge verzeichnet. In der vorliegenden Arbeit wurden zwei biologische, binäre Klassifikationsprobleme mit Uncertainty Sampling Pool-Based AL und Query by Bagging Comitee Pool-Based AL untersucht. Der Generalized Learning Vector Quantization (GLVQ) und ein Multilayer Perzeptron (MLP) wurden als Klassifikatoren verwendet. Anhand eines linear trennbaren und eines nicht linear trennbaren Datensatzes wurden die Auswirkungen der Anzahl an gelabelten DP, mit welcher die Klassifikatoren zu Beginn trainiert wurden, auf die Accuracy untersucht. Die AL-Accuracy näherte sich für das anfängliche Training der Klassifikatoren mit 10 % gelabelten DP bereits stark an die Accuracy im klassischen maschinellen Lernen an und war teilweise sogar größer. In einem weiteren Experiment wurden daher die Klassifikatoren anfänglich mit nur 1 % gelabelten DP trainiert. Es wurde die Auswirkung der Anzahl nachgelabelter DP, mit welcher die Klassifikatoren nachtrainiert wurden, auf die Accuracy untersucht. Für den linear trennbaren Datensatz war die Anwendung von AL mit dem GLVQ und 10 nachgelabelten DP sowie mit dem MLP und 50 nachgelabelten DP erfolgreich. Bei dem nicht linear trennbaren Datensatz wurde mit dem MLP zumindest eine Tendenz, dass AL die Accuracy verbessert, festgestellt. Jedoch reichten 50 nachgelabelte DP nicht aus.
Die vorliegende Arbeit beschäftigte sich mit einer Analyse von Methoden des maschinellen Lernens, mit Hinblick auf ihre unterstützende Wirkung für den intralingualen Übersetzungsprozess von deutschen standardsprachlichen zu Leichte Sprache Texten. Für diesen Zweck wurde ein Vergleich von relevanten Methoden, in diesem Fall die der statistischen maschinellen Übersetzung und die der neuronalen maschinellen Übersetzung aus dem Bereich des maschinellen Lernens und des Natural Language Processing aufgestellt. Dabei wurde der potenzielle Funktionsumfang, die Voraussetzungen sowie die Implementierbarkeit verglichen. Das Ergebnis dieses Vergleiches war es das, dass Potenzial durchaus gegeben ist mittels dieser Methoden den Übersetzungsprozess zu unterstützen. Jedoch bedingt das Fehlen eines Textkorpus für deutsche Standard Sprache und ein dazugehöriger Textkorpus der Leichten Sprache, das diese Methoden nicht implementiert, wurden konnten. Es konnten drei Funktionen umgesetzt werden, die den Übersetzungsprozess unterstützen. Zum einen die Funktion für die Anzeige von gebräuchlicheren Synonymen von Wörtern, eine Funktion für die automatische Generierung von
Zusammenfassungen und eine Funktion für Anzeige von Umformulierungen für Zahlen aus den Bereichen hohe Zahlen, alte Jahreszahlen und Prozent Zahlen. Die Evaluation der Funktionen mittels einer zufällig generierten Wortliste und ausgewählter Nachrichten für die Zusammenfassung und Zahlenbereiche ergab. Das diese Funktionen eine unterstützende Wirkung haben, jedoch stark fehleranfällig sind.
Machine learning models for timeseries have always been a special topic of interest due to their unique data structure. Recently, the introduction of attention improved the capabilities of recurrent neural networks and transformers with respect to their learning tasks such as machine translation. However, these models are usually subsymbolic architectures, making their inner working hard to interpret without comprehensive tools. In contrast, interpretable models such learning vector quantization are more transparent in the ability to interpret their decision process. This thesis tries to merge attention as a machine learning function with learning vector quantization to better handle timeseries data. A design on such a model is proposed and tested with a dataset used in connection with the attention based transformers. Although the proposed model did not yield the expected results, this work outlines improvements for further research on this approach.
Analysis of Continuous Learning Strategies at the Example of Replay-Based Text Classification
(2023)
Continuous learning is a research field that has significantly boosted in recent years due to highly complex machine and deep learning models. Whereas static models need to be retrained entirely from scratch when new data get available, continuous models progressively adapt to new data saving computational resources. In this context, this work analyzes parameters impacting replay-based continuous learning approaches at the example of a data-incremental text classification task using an MLP and LSTM. Generally, it was found that replay improves the results compared to naive approaches but achieves not the performance of a static model. Mainly, the performances increased with more replayed examples, and the number of training iterations has a significant influence as it can partly control the stability-plasticity-trade-off. In contrast, the impact of balancing the buffer and the strategy to select examples to store in the replay buffer were found to have a minor impact on the results in the present case.
In this thesis, we focus on using machine learning to automate manual or rule-based processes for the deduplication task of the data integration process in an enterprise customer experience program. We study the underlying theoretical foundations of the most widely used machine learning algorithms, including logistic regression, random forests, extreme gradient boosting trees, support vector machines, and generalized matrix learning vector quantization. We then apply those algorithms to a real, private data set and use standard evaluation metrics for classification, such as confusion matrix, precision, and recall, area under the precision-recall curve, and area under the Receiver Operating Characteristic curve to compare their performances and results.
As new sensors are added to VR headsets, more data can be collected. This introduces a new potential threat to user privacy. We focused on the feasibility of extracting personal information from eye-tracking. To achieve this, we designed a preliminary user study focusing on the pupil response to audio stimuli. We used a variation of machine learning models to test the collected data to determine the feasibility of obtaining information such as the age or gender of the participant. Several of the experiments show promise for obtaining this information. We were able to extract with reasonable certainty whether caffeine was consumed and the gender of the participant. This demonstrates the unknown threat that embedded sensors pose to users. A further studies are planned to verify the results.
Many companies use machine learning techniques to support decision-making and automate business processes by learning from the data that they have. In this thesis we investigate the theory behind the most widely used in practice machine learning algorithms for solving classification and regression problems.
In particular, the following algorithms were chosen for the classification problem: Logistic Regression, Decision Trees, Random Forest, Support Vector Machine (SVM), Learning Vector Quantization (LVQ). As for the regression problem, Decision Trees, Random Forest and Gradient Boosted Tree were used. We then apply those algorithms to real company data and compare their performances and results.
Die vorliegende Arbeit dient als Grundlage zur Umsetzung für eine automatisierte Klassifizierung von textuellen Fehlermeldungen. Das Hauptziel ist ein grundlegendes Verständnis für die Herangehensweise zum Aufbau eines maschinellen Lernsystems zu erreichen. Es werden verschiedene Arten des maschinellen Lernens erläutert. Auswahl und Aufbau eines Lernmodells werden von unterschiedlichen Seiten beleuchtet, um einen Überblick der einzelnen Schritte zu gewinnen. Zur Gewährleistung eines praktischen Lösungsansatz wurden bereits erste Tests mit einem ausgewählten Lernmodell durchgeführt.
Das Ziel dieser Masterarbeit ist die Evaluierung des Realtime Multi-Person 2D Pose Estimation Frameworks OpenPose. Dazu wird die Forschungsfrage gestellt, bis zu welcher Pixelgröße ein Mensch allgemein von dem System mit einer Sicherheit von über 50% richtig detektiert und dargestellt wird. Um die Forschungsfrage zu beantworten ist eine Studie mit sieben Probanden durchgeführt wurden. Aus der Datenerhebung geht hervor, dass der gesuchte Confidence Value zwischen 110px und 150px Körpergröße in von Menschen digitalen Bildern erreicht wird.
In this paper, we conduct experiments to optimize the learning rates for the Generalized Learning Vector Quantization (GLVQ) model. Our approach leverages insights from cog- nitive science rooted in the profound intricacies of human thinking. Recognizing that human-like thinking has propelled humankind to its current state, we explore the applica- bility of cognitive science principles in enhancing machine learning. Prior research has demonstrated promising results when applying learning rate methods inspired by cognitive science to Learning Vector Quantization (LVQ) models. In this study, we extend this approach to GLVQ models. Specifically, we examine five distinct cognitive science-inspired GLVQ variants: Conditional Probability (CP), Dual Factor Heuristic (DFH), Middle Symmetry (MS), Loose Symmetry (LS), and Loose Symme- try with Rarity (LSR). Our experiments involve a comprehensive analysis of the performance of these cogni- tive science-derived learning rate techniques across various datasets, aiming to identify optimal settings and variants of cognitive science GLVQ model training. Through this research, we seek to unlock new avenues for enhancing the learning process in machine learning models by drawing inspiration from the rich complexities of human cognition. Keywords: machine learning, GLVQ, cognitive science, cognitive bias, learning rate op- timization, optimizers, human-like learning, Conditional Probability (CP), Dual Factor Heuristic (DFH), Middle Symmetry (MS), Loose Symmetry (LS), Loose Symmetry with Rarity (LSR).
Differentiation is ubiquitous in the field of mathematics and especially in the field of Machine learning for calculations in gradient-based models. Calculating gradients might be complex and require handling multiple variables. Supervised Learning Vector Quantization models, which are used for classification tasks, also use the Stochastic Gradient Descent method for optimizing their cost functions. There are various methods to calculate these gradients or derivatives, namely Manual Differentiation, Numeric Differentiation, Symbolic Differentiation, and Automatic Differentiation. In this thesis, we evaluate each of the methods mentioned earlier for calculating derivatives and also compare the use of these methods for the variants of Generalized Learning Vector Quantization algorithms.
In the past few years Generative models have become an interesting topic in the field of Machine Learning (ML). Variational Autoencoder (VAE) is one of the popular frameworks of generative models based on the work of D.P Kingma and M. Welling [6] [7]. As an alternative to VAE the authors in [12] proposed and implemented Information Theoretic Learning (ITL) based Autoencoder. VAE and ITL Autoencoder are a combination of the neural networks and probabilistic graphical models (PGM) [7]. In modern statistics it is difficult to compute the approximation ofthe probability densities. In this paper we make use of Variational Inference (VI) technique from machine learning that approximate the distributions through optimization. The closeness between the distributions are measured by the information theoretic divergence measures such as Kullbach-Liebler, Euclidean and Cauchy Schwarz divergences. In this thesis, we study theoretical and experimental results of two different frameworks of generative models which generate images of MNIST handwritten characters [8] and Yale face database B [3]. The results obtained show that the proposed VAE and ITL Autoencoder are capable of generating the underlying structure of the example datasets
In machine learning, Learning Vector Quantization (LVQ) is well known as supervised vector quantization. LVQ has been studied to generate optimal reference vectors because of its simple and fast learning algorithm [2]. In many tasks of classification, different variants are considered while training a model and a consideration of variants of large margin in LVQ helps to get significant
results [20]. Large margin LVQ (LMLVQ) is to maximize the distance between decision hyperplane and data points. In this thesis, a comparison of different variants of Generalized Learning Vector Quantization (GLVQ) and Large margin in LVQ is proposed along with visualization, implementation and experimental results.
Diese Arbeit beschäftigt sich mit dem Erstellen semantischer Encodings von Bilddaten. Um diese Kodierungen aus den Daten zu extrahieren, wird ein künstliches neuronales Netzwerk auf
Videobild Interpolation trainiert. Die daraus erlernten Encodings sollen anschließend auf ihre Anwendbarkeit in einer anderen Aufgabe der KI gestützten Bildverarbeitung, der Extraktion von Landmarken auf Menschen, getestet werden.
A relatively new research field of neurosciences, called Connectomics, aims to achieve a full understanding and mapping of neural circuits and fine neuronal structures of the nervous system in a variety of organisms. This detailed information will provide insight in how our brain is influenced by different genetic and psychiatric diseases, how memory traces are stored and ageing influences our brain structure. It is beyond question that new methods for data acquisition will produce large amounts of neuronal image data. This data will exceed the zetabyte range and is impossible to annotate manually for visualization and analysis. Nowadays, machine learning algorithms and specially deep convolutional neuronal networks are heavily used in medical imaging and computer vision, which brings the opportunity of designing fully automated pipelines for image analysis. This work presents a new automated workflow based on three major parts including image processing using consecutive deep convolutional networks, a pixel-grouping step called connected components and 3D visualization via neuroglancer to achieve a dense three dimensional reconstruction of neurons from EM image data.
Data streams change their statistical behaviour over the time. These changes can occur gradually or abruptly with unforeseen reasons, which may effect the expected outcome. Thus it is important to detect concept drift as soon as it occurs. In this thesis we chose distance based methodology to detect presence of concept drift in the data streams. We used generalized learning vector quantization(GLVQ) and generalized matrix learning vector quantization( GMLVQ) classifiers for distance calculation between prototypes and data points. Chi-square and Kolmogorov–Smirnov tests are used to compare the distance distributions of test and train data sets to indicate the drift presence.
Diese Arbeit beschäftigt sich damit, verschiedene Methoden des maschinellen Lernens zu testen und mit der Frage, ob es damit möglich ist, auffällige Anmeldungen zu erkennen. Es ist von Interesse, spezielle abnormale Anmeldemuster zu erkennen, welche im Kontext eines Angriffes genutzt werden. Diese können anschließend verwendet werden, um Angreifer bzw. kompromittierte Nutzer aus einem Netzwerk zu identifizieren. Die Schwierigkeit, dementsprechende auffällige Anmeldungen zu erkennen, steigt dabei mit zunehmender Anzahl an Angriffen. Des Weiteren beeinflusst die Vielfalt in den Verhaltensweisen die Erkennung. Demzufolge werden verschiedene Methoden getestet, mehrere Szenarien simuliert und anschließend werden anhand eines echten Testfalls die Methoden bzw. das Verfahren validiert. Als Endergebnis der Arbeit entstehen eine Software und ein Verfahren zur Erkennung von auffälligen Anmeldungen.
Embeddings for Product Data
(2022)
The E-commerce industry has grown exponentially in the last decade, with giants like Amazon, eBay, Aliexpress, and Walmart selling billions of products. Machine learning techniques can be used within the e-commerce domain to improve the overall customer journey on a platform and increase sales. Product data, in specific, can be used for various applications, such as product similarity, clustering, recommendation, and price estimation. For data from these products to be used for such applications, we have to perform feature engineering. The idea is to transform these products into feature vectors before training a machine learning model on them. In this thesis, we propose an approach to create representations for heterogeneous product data from Unite’s platform in the form of structured tabular records. These tables consist of attributes having different information ranging from product-ids to long descriptions. Our model combines popular deep learning approaches used in natural language processing to create numerical representations, which contain mostly non-zeros elements in an array or matrix called as dense representation for all products. To evaluate the quality of these feature vectors, we validate how well the similarities between products are captured by these dense representations. The evaluations are further divided into two categories. The first category directly compares the similarities between individual products. On the other hand, the second category uses these dense vectors in any of the above- mentioned applications as inputs. It then evaluates the quality of these dense representation vectors based on the accuracy or performance of the defined application. As result, we explain the impact of different steps within our model on the quality of these learned representations.