Establishing a Workflow for Detection of mRNA-Degradation Patterns based on Cluster Analysis Using a Novel Gene-Expression Correlation Similarity Measure

Bohnsack, Katrin Sophie

In this work, the task is to cluster microarray gene expression data of the cyanobacterium Nostoc PCC 7120 for detection of messenger RNA (mRNA) degradation patterns. Searched are characteristic patterns of degradation which are caused by specific enzymes (ribonucleases) allowing a further biological investigation regarding biochemical mechanisms. The mRNA degradation is part of the regulation of gene expression because it regulates the amount and longevity of mRNA, which is available for translation into proteins. A particular class of RNA degrading enzymes are exoribonucleases which degrade the molecule from its ends, whereby a degradation from the 5’ end, the 3’ end or from both ends is theoretically possible. In this investigation, the information about exoribonucleolytic degradation is given in a microarray data set containing gene expression values of 1,251 genes. The data set provides gene expression vectors containing the expression values of up to ten short distinct sections of a gene ordered from the genes 5’ end to its 3’ end. For each gene, expression vectors are available for both nitrogen fixing and non-nitrogen fixing conditions, which have to be considered separately due to biological reasons. Accordingly, after filtering and preprocessing, two datasets for clustering are obtained consisting of 133 ten-dimensional expression vectors. The similarity of the expression vectors is judged by a newly correlation based similarity measure and compared with the results obtained by use of the Euclidean distance. A non-linear transformation of the correlations was applied to obtain a dissimilarity measure. By choice of parameters within this transformation a user specific differentiation between negative and positive correlated gene expression vectors and an adequate adjustment regarding the noise level of gene expression values is possible. Clustering was performed using Affinity Propagation (AP). The number of clusters obtained by AP depends on the so-called self-similarity for the data vectors. This dependence was used to identify stable cluster solutions by self-similarity control. To evaluate the clustering results, Median Fuzzy c-Means (M-FCM) was used. Further, several cluster validity measures are applied and visual inspections by t-distributed Stochastic Neighbor Embedding (t-SNE) as well as cluster visualization are provided for mathematical interpretation analysis of clusters. To validate the clustering results biologically, the found data structure is checked for biological adequacy. A deeper investigation into the mechanisms behind mRNA-degradation was achieved by use of a RNA-Seq data set. Contained 40 (base pair) bp long reads for non-nitrogen fixing and nitrogen fixing conditions were assembled using bacteria-specific ab-initio assembly of Rockhopper. Thus, mRNA (transcript)-sequences of the clustered genes are obtained. A further investigation of the untranslated regions (UTRs) is performed here due to the assumption that exoribonucleases recognize specific transcript-sequences outside of the annotated gene regions as their binding sites. These UTRs need to be analyzed regarding sequence similarity using motif-finding algorithms.

Author:	Katrin Sophie Bohnsack
Advisor:	Röbbe Wünschiers, Thomas Villmann
Document Type:	Bachelor Thesis
Language:	English
Year of Completion:	2018
Granting Institution:	Hochschule Mittweida
Release Date:	2019/12/03
GND Keyword:	Cluster-Analyse; Messenger-RNS
Institutes:	Angewandte Computer‐ und Biowissenschaften
DDC classes:	519.53 Datenanalyse, Cluster-Analyse
Open Access:	Frei zugänglich
Licence (German):	Urheberrechtlich geschützt

Establishing a Workflow for Detection of mRNA-Degradation Patterns based on Cluster Analysis Using a Novel Gene-Expression Correlation Similarity Measure

Entwicklung eines Arbeitsablaufs zur Detektion von mRNA-Degradierungs-Mustern basierend auf einer Clusteranalyse unter Verwendung eines neuen Korrelations-Ähnlichkeitsmaßes für Gen-Expressions-Daten

Download full text files

Export metadata

Additional Services

Statistics