KDDRG - Association Rule Mining

Computer Science Department

Knowledge Discovery and Data Mining Research Group
KDDRG

Research Projects on
Novel Association Rule Mining Algorithms and Tools

Description | Adaptive Support | Set-Valued Data

GENERAL DESCRIPTION

Association rules identify collections of data attributes that are statistically related in the underlying data. An association rule is of the form X => Y where X and Y are disjoint conjunctions of attribute-value pairs. The confidence of the rule is the conditional probability of Y given X, Pr(Y|X), and the support of the rule is the prior probability of X and Y, Pr(X and Y). Here probability is taken to be the observed frequency in the data set. The traditional association rule mining problem can be described as follows. Given a database of transactions, a minimal confidence threshold and a minimal support threshold, find all association rules whose confidence and support are above the corresponding thresholds. In our research, we have extended this traditional framework to better fit new application domains, in particular genetic analysis and electronic commerce.

ADAPTIVE-SUPPORT ASSOCIATION RULE MINING

Project Members

Faculty: Carolina Ruiz and Sergio A. Alvarez.
Students: Weiyang Lin.

Project Description

In this research we propose a new approach for mining association rules of classification type particularly suited for use in the electronic commerce area of collaborative recommender systems. Such systems rely on information about relationships between different users' preferences in order to recommend items of potential interest to the target user. Despite their successful application to other domains, existing association rule mining techniques are not suitable for the recommendation domain because they mine many rules that are not relevant to a given user. Also, they require that the minimum support (also known as the significance) of the mined rules be specified in advance, often leading to too many or too few rules. In contrast, our approach adjusts the minimum support so that the number of rules obtained is within a specified range, thus avoiding excessive computation time while guaranteeing that enough rules are provided to allow good classification performance. Our experimental results show that the rules mined by our approach allow excellent recommendation performance.

MINING ASSOCIATION RULES FROM SET-VALUED DATA

Project Members

Faculty: Carolina Ruiz.
Students: Chris Shoemaker.

Project Description

This project presents an association rule mining system that is capable of handling set-valued attributes. Our previous research exposed us to a variety of real-world biological datasets that contain attributes whose values are sets of elements, instead of just individual elements. However, very few data mining tools accept datasets that contain these set-valued attributes, and none of them allow the mining of association rules directly from this type of data. This has motivated our research to develop a system that can discover association rules from such data. We have outlined various techniques for transforming set-valued attributes into normal attributes, and have described the conditions under which each of these transformations are suitable and unsuitable. We have implemented a system that directly accepts set-valued data and succeeds in discovering association rules, and even classification rules. Our system functions by automating the best of these transformations and applying the Apriori algorithm, which is the community standard algorithm for mining association rules. Our system makes the creation of input files containing set-valued data much easier, and makes the mining of association rules directly from that data possible.

MERGING THE ASSOCIATION RULE MINING MODULES OF THE WEKA AND ARMINER DATA MINING SYSTEMS

Project Members

Faculty: Carolina Ruiz.
Students: Zack Stoecker-Sylvia.

Project Description

Construction of a new association rule mining module for the WEKA data mining system is described. The new module is created by merging the existing WEKA's association rule mining module and the rule mining portion of another sytem, ARMiner. The data representation of each system is examined and the features of the two systems are evaluated for inclusion in the new system. The resulting system is described and tested for performance.

MINING ASSOCIATION RULES FROM TIME SEQUENCE-VALUED DATA

Project Members

Faculty: Carolina Ruiz.
Students: Keith A. Pray.

Project Description

This research project aims to introduce time sequence attributes and develop new algorithms for finding knowledge in complicated temporal data sets. Such data sets can be found in the domains of computer system performance measurement, stock market analysis and complex system diagnostics among others. By representing attributes of these data sets as sequences of values ordered along a shared time line in a single instance in addition to regular single valued attributes, finding temporal associations that represent far more complicated behavior than currently available data mining systems are capable of identifying is made possible. By developing algorithms that can mine this data in a viable fashion a tremendous improvement can be had on the volume and quality of knowledge gained. Also, large amounts of manual analysis time spent in these domains can be saved. Time sequence attributes introduce some challenges to mining association rules but also offer the possibility of great rewards. These challenges include:

Extending existing data representation method(s) used in data mining systems.
Extending existing data mining systems and algorithms to handle this data natively without reduction or summarization.
Mining patterns over datasets that contain both time sequence attributes and normal (i.e. relational) attributes.
Quantifying the order of complexity of the knowledge represented using these attributes.
Developing algorithms that make mining data of this complexity computationally feasible.

This project's aim is to develop these algorithms for mining these data and define methods for quantify the computational cost and the gains in knowledge that result.