|Abstract: ||RNAs are involved in different facets of biological processes; including but not limited to controlling and inhibiting gene expressions, enabling transcription and translation from DNA to proteins, in processes involving diseases such as cancer, and virus-host interactions. As such, there are useful applications that may arise from studies and analyses involving RNAs, such as detecting cancer by measuring the abundance of specific RNAs, detecting and identifying infections involving RNA viruses, identifying the origins of and relationships between RNA viruses, and identifying potential targets when designing novel drugs.
Extracting sequences from RNA samples is usually not a major limitation anymore thanks to sequencing technologies such as RNA-Seq. However, accurately identifying and analyzing the extracted sequences is often still the bottleneck when it comes to developing RNA-based applications.
Like proteins, functional RNAs are able to fold into complex structures in order to perform specific functions throughout their lifecycle. This suggests that structural information can be used to identify or classify RNA sequences, in addition to the sequence information of the RNA itself. Furthermore, a strand of RNA may have more than one possible structural conformations it can fold into, and it is also possible for a strand to form different structures in vivo and in vitro. However, past studies that utilized secondary structure information for RNA identification purposes have relied on one predicted secondary structure for each RNA sequence, despite the possible one-to-many relationship between a strand of RNA and the possible secondary structures. Therefore, we hypothesized that using a representation that includes the multiple possible secondary structures of an RNA for classification purposes may improve the classification performance.
We proposed and built a pipeline that produces secondary structure fingerprints given a sequence of RNA, that takes into account the aforementioned multiple possible secondary structures for a single RNA. Using this pipeline, we explored and developed different types of secondary structure fingerprints in our studies. A type of fingerprints serves as high-level topological representations of the RNA structure, while another type represents matches with common known RNA secondary structure motifs we have curated from databases and the literature. Next, to test our hypothesis, the different fingerprints are then used with deep learning and with different datasets, alone and together with various sequence-based features, to investigate how the secondary structure fingerprints affect the classification performance.
Finally, by analyzing our findings, we also propose approaches that can be adopted by future studies to further improve our secondary structure fingerprints and classification performance.|