WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

Knowledge Discovery and Data Mining Research Group 
KDDRG

Research Projects on
Data Mining for Genetic Analysis 

Description | Gene Expression | SNP Analysis

------------------------------------------
GENERAL DESCRIPTION

This collection of collaborative research projects between the Computer Science and Biology and Biotechnology departments involves data collection and methodology development for the analysis of genetic information. Due to the success of many of the international genome projects, a large amount of DNA sequence data for both humans and model organisms is now publicly available. However, there are few automated techniques for analyzing these databases. The main purpose of this collection of projects is to develop new algorithms and tools for analysis of existing genomic data.

Read more about our interdisciplinary bioinformatics research group

------------------------------------------
DISTANCE-BASED ASSOCIATION RULE MINING

Project Members

Project Description

The main goal of this thesis work is to develop, implement and evaluate an algorithm that enables mining association rules from datasets that contain quantified distance information among the items. This is accomplished by extending and enhancing the Apriori Algorithm, which is the standard algorithm to mine association rules. The Apriori algorithm is not able to mine association rules that bear significance of distance information among the items that construct the rules. This thesis enhances the main Apriori property by requiring itemsets forming rules to "deviate properly" in addition to satisfy the minimal support threshold. We say that an itemset deviates properly if all combinations of pair-wise distances among the items are highly conserved in the sequences of the dataset where these items occur. This thesis introduces the notion of proper deviation and provides the precise procedure and measures that characterize it. Integrating the notion of distance preserving frequent itemset and proper deviation into the standard Apriori Algorithm leads to the construction of our Distance-based Association Rule Mining (DARM) algorithm.

DARM can be applied in data mining and knowledge discovery from genetic, financial, retail, time sequence data or any domain which distance information between items is of importance. This thesis chooses the area of gene expression and regulation in eukaryotic organisms as the application domain. The data from the domain is used to produce DARM Rules. Sets of those rules are used for building predictive models. The accuracy of those models is tested. In addition, predictive accuracies of the models built with and without distance information are compared.

------------------------------------------
MOTIF ANALYSIS OF GENE EXPRESSION

Project Members

Project Description

This project created a computational biology tool to discover regulatory regions in DNA and their impact on gene expression. The MAGE computer system, utilizing three software components, predicts gene expression in cells. Motifs or patterns in the promoter regions from the C. elegans and C. briggsae genomes were found using an expectation maximization and Gibbs algorithms. These motifs were mined and used to build association rules models that predict cell specific expression. The accuracy of various models was assessed.

------------------------------------------
COMPUTATIONAL ANALYSIS OF GENE EXPRESSION

Project Members

Project Description

This project created a computational biology tool to discover regulatory regions in DNA and their influence on gene expression. First, motifs in the C. elegans genome were found using an expectation maximization algorithm. Then association rules were mined to correlate the presence of motifs with cell type expressions. A software system, CAGE, was created to perform this process. CAGE generated various types of models that pinpoint the biological significance of the motifs. The predictive accuracy of the models was assessed.

------------------------------------------
MOTIF- AND EXPRESSION-BASED CLASSIFICATION OF DNA

Project Members

Project Description

One unanswered question in biology is how the expression of genes is controlled. Using computational tools to discover patters in DNA, the project team designed and implemented MEBCS, a data mining software package able to classify expression of a novel promoter sequence. MEBCS is capable of performing multipoint analysis on putative motifs, as discovered by the system, and constructs predictive models using prior expression information. These models were tested on a C.elegans dataset collected by the project team.

------------------------------------------
MINING GENETIC POLYMORPHISMS FOR PATTERNS IN HUMAN DISEASES

Project Members

Project Description

In this project, we use two data mining tools, CBA and WEKA, to search for genetic polymorphisms that affect two human diseases, autosomal recessive spinal muscular atrophy (SMA) and bilateral optic neuropathy (LHON). We modified WEKA to be able to mine genetic data. Our results provide a number of association rules, which identify important genetic variants and phenotypic factors affecting the disease state and severity.

------------------------------------------
[Return to the WPI Homepage]  [Return to the CS Homepage]
ruiz@cs.wpi.edu