The main goal of this thesis work is to develop, implement and evaluate an algorithm that enables mining association rules from datasets that contain quantified distance information among the items. This is accomplished by extending and enhancing the Apriori Algorithm, which is the standard algorithm to mine association rules. The Apriori algorithm is not able to mine association rules that bear significance of distance information among the items that construct the rules. This thesis enhances the main Apriori property by requiring itemsets forming rules to "deviate properly" in addition to satisfy the minimal support threshold. We say that an itemset deviates properly if all combinations of pair-wise distances among the items are highly conserved in the sequences of the dataset where these items occur. This thesis introduces the notion of proper deviation and provides the precise procedure and measures that characterize it. Integrating the notion of distance preserving frequent itemset and proper deviation into the standard Apriori Algorithm leads to the construction of our Distance-based Association Rule Mining (DARM) algorithm.
DARM can be applied in data mining and knowledge discovery from genetic, financial, retail, time sequence data or any domain which distance information between items is of importance. This thesis chooses the area of gene expression and regulation in eukaryotic organisms as the application domain. The data from the domain is used to produce DARM Rules. Sets of those rules are used for building predictive models. The accuracy of those models is tested. In addition, predictive accuracies of the models built with and without distance information are compared.
This project created a computational biology tool to discover regulatory regions in DNA and their impact on gene expression. The MAGE computer system, utilizing three software components, predicts gene expression in cells. Motifs or patterns in the promoter regions from the C. elegans and C. briggsae genomes were found using an expectation maximization and Gibbs algorithms. These motifs were mined and used to build association rules models that predict cell specific expression. The accuracy of various models was assessed.
This project created a computational biology tool to discover regulatory regions in DNA and their influence on gene expression. First, motifs in the C. elegans genome were found using an expectation maximization algorithm. Then association rules were mined to correlate the presence of motifs with cell type expressions. A software system, CAGE, was created to perform this process. CAGE generated various types of models that pinpoint the biological significance of the motifs. The predictive accuracy of the models was assessed.
One unanswered question in biology is how the expression of genes is controlled. Using computational tools to discover patters in DNA, the project team designed and implemented MEBCS, a data mining software package able to classify expression of a novel promoter sequence. MEBCS is capable of performing multipoint analysis on putative motifs, as discovered by the system, and constructs predictive models using prior expression information. These models were tested on a C.elegans dataset collected by the project team.
In this project, we use two data mining tools, CBA and WEKA, to search for genetic polymorphisms that affect two human diseases, autosomal recessive spinal muscular atrophy (SMA) and bilateral optic neuropathy (LHON). We modified WEKA to be able to mine genetic data. Our results provide a number of association rules, which identify important genetic variants and phenotypic factors affecting the disease state and severity.