|Abstract: ||In the past decade, fast advancements have been made in the sequencing, digitalization and collection of the biological data. However the bottleneck remains at the point of analysis and extraction of patterns from the data. We have developed a method that is aimed at widening this bottleneck by automating the knowledge extraction from the biological data. Our approach is aimed at discovering patterns in a set of DNA sequences based on the location of transcription factor binding sites or any other biological markers with the emphasis of discovering relationships. A variety of statistical and computational methods exists to analyze such data. However, they either require an initial hypothesis, which is later tested, or classify the data based on its attributes. Our approach does not require an initial hypothesis and the classification it produces is based on the relationships between attributes. The value of such approach is that is is able to uncover new knowledge about the data by inducing a general theory based on basic known rules.
The core of our approach lies in an inductive logic programming engine, which, based on positive and negative examples as well as background knowledge, is able to induce a descriptive, human-readable theory, describing the data. An application provides an end-to-end analysis of DNA sequences. A simple to use Web interface accepts a set of related sequences to be analyzed, set of negative example sequences to contrast the main set (optional), and a set of possible genetic markers as position-specific scoring matrices. A Java-based backend formats the sequences, determines the location of the genetic markers inside them and passes the information to the ILP engine, which induces the theory.
The model, assumed in our background knowledge, is a set of basic interactions between biological markers in any DNA sequence. This makes our approach applicable to analyze a wide variety of biological problems, including detection of cis-regulatory modules and analysis of ChIP-Sequencing experiments. We have evaluated our method in the context of such applications on two real world datasets as well as a number of specially designed synthetic datasets. The approach has shown to have merit even in situations when no significant classification could be determined.|