Refine
Document Type
- Bachelor Thesis (1)
- Master's Thesis (1)
Language
- English (2)
Keywords
- Bioinformatik (1)
- Cluster-Analyse (1)
- Messenger-RNS (1)
Institute
he automatic comparison of RNA/DNA or rather nucleotide sequences is a complex task requiring careful design due to the computational complexity. While alignment-based models suffer from computational costs in time, alignment-free models have to deal with appropriate data preprocessing and consistently designed mathematical data comparison. This work deals with the latter strategy. In particular, a systematic categorization is proposed, which emphasizes two key concepts that have to be combined for a successful comparison analysis: 1) the data transformation comprising adequate mathematical sequence coding and feature extraction, and 2) the subsequent (dis-)similarity evaluation of the transformed data by means of problem specific but mathematically consistent proximity measures. Respective approaches of different categories
of the introduced scheme are examined with regard to their suitability to distinguish natural RNA virus sequences from artificially generated ones encompassing varying degrees of biological feature preservation. The challenge in this application is the limited additional biological information available, such that the decision has to be made solely on the basis of the sequences and their
inherent structural characteristics. To address this, the present work focuses on interpretable, dissimilarity based classification models of machine learning, namely variants of Learning Vector Quantizers. These methods are known to be robust and highly interpretable, and therefore,
allow to evaluate the applied data transformations together with the chosen proximity measure with respect to the given discrimination task. First analysis results are provided and discussed, serving as a starting point for more in-depth analysis of this problem in the future.
In this work, the task is to cluster microarray gene expression data of the cyanobacterium Nostoc PCC 7120 for detection of messenger RNA (mRNA) degradation patterns. Searched are characteristic patterns of degradation which are caused by specific enzymes (ribonucleases) allowing a further biological investigation regarding biochemical mechanisms. The mRNA degradation is part of the regulation of gene expression because it regulates the amount and longevity of mRNA, which is available for translation into proteins. A particular class of RNA degrading enzymes are exoribonucleases which degrade the molecule from its ends, whereby a degradation from the 5’ end, the 3’ end or from both ends is theoretically possible.
In this investigation, the information about exoribonucleolytic degradation is given in a microarray data set containing gene expression values of 1,251 genes. The data set provides gene expression vectors containing the expression values of up to ten short distinct sections of a gene ordered from the genes 5’ end to its 3’ end. For each gene, expression vectors are available for both nitrogen fixing and non-nitrogen fixing conditions, which have to be considered separately due to biological reasons. Accordingly, after filtering and preprocessing, two datasets for clustering are obtained consisting of 133 ten-dimensional expression vectors. The similarity of the expression vectors is judged by a newly correlation based similarity measure and compared with the results obtained by use of the Euclidean distance. A non-linear transformation of the correlations was applied to obtain a dissimilarity measure. By choice of parameters within this transformation a user specific differentiation between negative and positive correlated gene expression vectors and an adequate adjustment regarding the noise level of gene expression values is possible.
Clustering was performed using Affinity Propagation (AP). The number of clusters obtained by AP depends on the so-called self-similarity for the data vectors. This dependence was used to identify stable cluster solutions by self-similarity control. To evaluate the clustering results, Median Fuzzy c-Means (M-FCM) was used. Further, several cluster validity measures are applied and visual inspections by t-distributed Stochastic Neighbor Embedding (t-SNE) as well as cluster visualization are provided for mathematical interpretation analysis of clusters.
To validate the clustering results biologically, the found data structure is checked for biological adequacy. A deeper investigation into the mechanisms behind mRNA-degradation was achieved by use of a RNA-Seq data set. Contained 40 (base pair) bp long reads for non-nitrogen fixing and nitrogen fixing conditions were assembled using bacteria-specific ab-initio assembly of Rockhopper. Thus, mRNA (transcript)-sequences of the clustered genes are obtained. A further investigation of the untranslated regions (UTRs) is performed here due to the assumption that exoribonucleases recognize specific transcript-sequences outside of the annotated gene regions as their binding sites. These UTRs need to be analyzed regarding sequence similarity using motif-finding algorithms.