CS 525D Fall 2009 - Project 4

Computer Science Department

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2009
Project 4: Association Rules

PROF. CAROLINA RUIZ

DUE DATE: Thursday Nov 19, 2009.

Slides: Submit by email by 1:00 pm.
Written report: Hand in a hardcopy by 2:00 pm.
Oral Presentation: during class that day.

This assignment consists of two parts:

A homework part in which you will focus on the construction and/or pruning of the models.
A project part in which you will focus on the experimental evaluation and analysis of the models.

I. Homework Part

[100 points] In this part of the assignment, you get familiar with the details of the Apriori algorithm to mine association rules.

Consider the dataset below, adapted from the Soybean dataset that comes with the Weka system.

relation 'soybean-subset

@attribute precip {lt-norm,norm,gt-norm}
@attribute temp {lt-norm,norm,gt-norm}
@attribute hail {yes,no}
@attribute plant-growth {norm,abnorm}

@data
gt-norm, norm,     yes,  abnorm
lt-norm, gt-norm,  yes,  abnorm
lt-norm, norm,     no,   abnorm
lt-norm, norm,     yes,  abnorm
lt-norm, gt-norm,  yes,  abnorm
lt-norm, gt-norm,  no,   abnorm
gt-norm, lt-norm,  yes,  abnorm
lt-norm, norm,     yes,  norm
lt-norm, lt-norm,  yes,  abnorm
lt-norm, gt-norm,  yes,  norm
lt-norm, norm,     yes,  norm
lt-norm, norm,     no,   norm
norm,    lt-norm,  no,   norm
lt-norm, norm,     yes,  norm

Faithfully following the Apriori algorithm with minimal support = 15% (that is, minimum support count = 2 data instances) and minimal confidence 90%. [Note that the dataset above contains repeated instances. Consider them as different transactions containing the same items. Hence, each of the repeated transactions/instances contributes towards the support of the itemsets that contain them.]

[70 points] Generate all the frequent itemsets by hand, level by level. Do it exactly as the Apriori algorithm would. When constructing level k+1 from level k, use the join condition to generate only those candidate itemsets that are potentially frequent, and use the prune condition to remove those candidate itemsets that won't be frequent because at least one of their subsets is not frequent. Mark with an "X" those itemsets removed by the prune condition, and don't count their support in the dataset. SHOW ALL THE DETAILS OF YOUR WORK.
[30 points] In this part, you will generate association rules with minimum confidence 90%. To save time, you don't have to generate all associations rules from all the frequent itemsets. Instead, select the largest itemset (i.e., the itemset with most items) that you generated in the previous part of this problem, and use it to generate all association rules that can be produced from it (i.e., association rules with 2, or with 3, or with 4, ... items). For each such rule, calculate its confidence (show the details), and mark those rules that have confidence greater than or equal to 90%. SHOW ALL THE DETAILS OF YOUR WORK.

II. Project Part

Project Instructions: Thoroughly read and follow the Project Guidelines. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.
Data Mining Technique(s): We will run experiment using the association rule mining algorithm Apriori.
Dataset(s): In this project, we will use the following dataset:
- The Mushroom Data Set
  Make sure to rename the attribute values so that the resulting association rules are easy to understand.
Performance Metric(s): Support and confidence of the rules. Experiment also with different "metricTypes" (i.e., lift, leverage, and conviction - Include in your report a definition/description of each of these metrics).
General Comments In constrast with our previous classification and regression projects, we won't use any evaluation protocol (e.g., 10-fold cross validation) for the association analysis of this project, as we're not using the rules for prediction. Focus instead on experimenting with different ways of preprocessing the data, varying the parameters of the Apriori algorithm, and providing your own method to evaluate the resulting collections of association rules.

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2009 Project 4: Association Rules

PROF. CAROLINA RUIZ

I. Homework Part

II. Project Part

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2009
Project 4: Association Rules