|Abstract: ||With recent experimental evidence, it has been shown that RNA (ribonucleic acid) plays a greater role in various cellular functions than previously thought. With the increasing number of known RNA families a need arises to develop computational techniques to analyze RNA sequences. An array of evolutionary related RNA sequences believed to contain signals at both the sequence and structure levels can be exploited to detect motifs common to all or a portion of those sequences. Finding these similar structural features can provide substantial information as to which parts of the sequence are functional.
Recently, Nguyen (M.A.Sc thesis, Electrical Engineering, University of Ottawa, 2004) introduced a novel approach for discovering consensus secondary structure motifs in a set of unaligned RNA sequences. The algorithm has been implemented in a software system called Seed. The aim of this thesis is to devise, implement and evaluate (3) scoring schemes for the software system. The first scoring scheme is based on the sum of the thermodynamics free energy, based on the nearest neighbor model. We then present a general framework for evaluation of RNA structures using statistical regression analysis. The third scoring scheme to be validated is based on the framework of minimum description length principle.
We implemented and validated the above scoring schemes on four different data sets having varying range of complexity. The first two were derived from selected members of UTRdb database where the coding region is flanked by two untranslated regions (5' UTR and 3' UTR). The others were assembled using a subset of the sequences from Masoumi and Turcotte (IJBRA, 1(2), 230--245, 2005). By three measures, positive predicted value, sensitivity and Matthews correlation coefficient, our methods performed well on the data sets and showed significant ranking statistics. Also, our first method compares favorably with state-of-the-art tool, RNAprofile. For small motifs, the scoring methods are able to rank motifs with high PPV/sensitivity, often 100%. The top ranked motifs were used as input constraints for MFOLD, a widely used tool for RNA secondary structure determination. They showed improvements in both PPV and sensitivity measurements of the foldings made.|