
DUE DATE: Thursday March 13, 2008.
- Slides: Submit by email by 2:30 pm.
- Written report: Hand in a hardcopy by 3:30 pm.
- Oral Presentation: during class that day.

This assignment consists of two parts:
- A homework part in which you will focus on the construction and/or
pruning of the models.
- A project part in which you will focus on the experimental evaluation
and analysis of the models.
I. Homework Part
[100 points]
In this part of the assignment, you get familiar with the details of the
Apriori algorithm to mine association rules.
Consider the dataset below, adapted from
G. F. Luger and W. A. Stubblefield "Artificial Intelligence Structures
and Strategies for Complex Problem Solving" Third edition Addison-Wesley, 1998.
@attribute credit_history {bad,unknown,good}
@attribute debt {low,high}
@attribute collateral {none,adequate}
@attribute income {0-15,15-35,>35}
credit_history (CH)
| debt (D)
| collateral (CO)
| income (I)
|
bad | low | none | 0-15
|
unknown | high | none | 15-35
|
unknown | low | none | 15-35
|
bad | low | none | 0-15
|
unknown | low | adequate | >35
|
unknown | low | none | >35
|
unknown | high | none | 0-15
|
bad | low | adequate | >35
|
good | low | none | >35
|
good | high | adequate | >35
|
good | high | none | 0-15
|
good | high | none | 15-35
|
good | high | none | >35
|
bad | high | none | 15-35
|
Faithfully following the Apriori algorithm with minimal support = 14%
(that is, minimum support count = 2 data instances)
and minimal confidence 90%. [Note that the dataset above contains
repeated instances. Consider them as different transactions containing
the same items. Hence, each of the repeated transactions/instances
contributes towards the support of the itemsets that contain them.]
- [70 points]
Generate all the frequent itemsets by hand, level by level.
Do it exactly as the Apriori algorithm would.
When constructing level k+1 from level k,
use the join condition to generate only those candidate
itemsets that are potentially frequent, and
use the prune condition to remove those candidate itemsets
that won't be frequent because at least one of their subsets is
not frequent. Mark with an "X" those itemsets removed by
the prune condition, and don't count their support in
the dataset.
SHOW ALL THE DETAILS OF YOUR WORK.
- [30 points]
In this part, you will generate association rules with minimum
confidence 90%. To save time,
you don't have to generate all associations rules from all the
frequent itemsets. Instead, select the largest itemset (i.e.,
the itemset with most items) that you
generated in the previous part of this problem, and use it to
generate all association rules that can be produced from it
(i.e., association rules with 2, or with 3, or with 4, ...
items). For each such rule, calculate its confidence (show the
details), and mark
those rules that have confidence greater than or equal to 90%.
SHOW ALL THE DETAILS OF YOUR WORK.
II. Project Part
[200 points: 100 points for association rules per dataset].
See
Project Guidelines
for the detailed distribution of these points]
- Project Instructions:
Thoroughly read and follow the
Project Guidelines.
These guidelines contain detailed information about how to structure your
project, and how to prepare your written and oral reports.
- Data Mining Technique(s):
We will run experiment using the association rule mining algorithm Apriori
on each dataset.
- Dataset(s):
In this project, we will use two datasets:
-
The
1995 Data Analysis Exposition.
This dataset contains college data taken from the U.S. News & World Report's Guide to
America's Best Colleges. The necessary files are:
- A dataset that you choose depending on your own insterests.
It may be a dataset you are working with for your research or your job.
It should contain enough instances (at least 200 instances) and
several attributes (at least 10). Ideally it should contain a good mix of
numeric and nominal attributes.
I include below some links to Data Repositories containing
multiple datasets to choose from:
- Performance Metric(s):
Support and confidence of the rules.
- General Comments
In constrast with our previous classification and regression projects,
we won't use any evaluation protocol (e.g., 10-fold cross validation)
for the association analysis of this project, as we're not using the
rules for prediction.
Focus instead on experimenting with different ways of preprocessing
the data, varying the parameters of the Apriori algorithm, and
providing your own method to evaluate the resulting collections of
association rules.