CS 4445 A Term 2008 - Homework 2 and Project 2

Computer Science Department

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2008
Homework and Project 2: Data Pre-processing, Mining, and Evaluation of Rules

PROF. CAROLINA RUIZ

DUE DATE:
The individual homework assignment is due on Friday, Sept. 19 2008 at 1:00 pm, and
The individual+group project is due on Friday, Sept. 19 2008 at 12:00 noon.

Homework and Project Objectives
Homework Assignment
Project Assignment

HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is multi-fold:

To gain experience with the mining and evaluation of classification rules.
To gain experience with the mining of association rules.
To compare these two data mining techniques over two datasets.

Readings: Read in great detail Sections 4.1, 4.4, 4.5 and 6.2 from your textbook.

INDIVIDUAL HOMEWORK ASSIGNMENT

Consider the following dataset, adapted from the Iris Dataset available at the The University of California Irvine (UCI) Machine Learning Data Repository, (available also in the Weka data directory).

ATTRIBUTES:	POSSIBLE VALUES:

sepallength 	{sl-short,sl-med,sl-long}
petallength 	{pl-short,pl-med,pl-long}
petalwidth 	{pw-short,pw-med,pw-long}
class 		{Iris-setosa,Iris-versicolor,Iris-virginica}

sepallength	petallength	petalwidth	class
sl-short	pl-short	pw-short	Iris-setosa
sl-short	pl-short	pw-short	Iris-setosa
sl-short	pl-short	pw-short	Iris-setosa
sl-long	pl-med	pw-med	Iris-versicolor
sl-long	pl-long	pw-med	Iris-versicolor
sl-med	pl-med	pw-med	Iris-versicolor
sl-med	pl-med	pw-med	Iris-versicolor
sl-med	pl-long	pw-med	Iris-virginica
sl-med	pl-long	pw-long	Iris-virginica
sl-long	pl-long	pw-long	Iris-virginica

(50 points) Classification Rules:
See Solutions by Piotr Mardziel and Amro Khasawneh.
Construct "by hand" all the perfect classification rules that the Prism algorithm would output for this dataset using the ratio p/t to rank the attribute-values that are candidates for inclusion in a rule. You written solutions should show all your work. That is, the list of all attribute-values that were candidates during each of the stages of the rule construction process and which ones were selected.
(50 points) Association Rules:
See Solutions by Piotr Mardziel and Amro Khasawneh.
Mine association rules by hand from this dataset by faithfully following the Apriori algorithm with minimum support = 35% (since the dataset contains 10 instances, then the min support count is 3 instances) and minimum confidence 90%. Note that you need to produce regular association rules, not classification association rules.
1. (35 points) Generate all the frequent itemsets by hand, level by level. Do it exactly as the Apriori algorithm would. When constructing level k+1 from level k, use the join condition to generate only those candidate itemsets that are potentially frequent, and use the prune condition to remove those candidate itemsets that won't be frequent because at least one of their subsets is not frequent. Mark with an "X" those itemsets removed by the prune condition, and don't count their support in the dataset. SHOW ALL THE DETAILS OF YOUR WORK.
2. (15 points) In this part, you will generate association rules with minimum confidence 90%. To save time, you don't have to generate all associations rules from all the frequent itemsets. Instead, select the largest itemset (i.e., the itemset with most items) that you generated in the previous part of this problem, and use it to generate all association rules that can be produced from it using all the items in the itemset (i.e., if the itemset contains n items consider only rules that include all n items). For each such rule, calculate its confidence (show the details), and mark those rules that have confidence greater than or equal to 90%. SHOW ALL THE DETAILS OF YOUR WORK.

INDIVIDUAL + GROUP PROJECT ASSIGNMENT
[800 points: 100 points per data mining technique per dataset per individual/group parts. See Project Guidelines for the detailed distribution of these points]

Project Instructions: THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.
Individual part and group part of this project: For each of the datasets in this project, complete the individual and the group parts below.
- Individual Part:
  [100 points per technique per dataset] Follow the Experiments' Guidelines described in the Project Guidelines and record your observations and results as described there in the individual section of your written report.
- Group Part:
  Once that you have completed the Individual Part of the project on your own, work with your project partner on the following two subparts:
  1. [15 points per dataset] Analysis of the experiments and results that each of you obtained: Describe in your written report the similarities and differences of the experiments run by each of you, and hence of the results obtained. Explain in detail those similarities and differences that you observed.
  2. [85 points per dataset] Joint experiments: Based on the joint analysis and the experience you have gained, design new, interesting/useful experiments to run together with your partner. Run those experiments and analyze the results together. Follow the Experiments' Guidelines (items 3-8) described in the Project Guidelines and record your observations and results as described there in the group section of your written report.
Data Mining Technique(s): We will run experiment using the following data mining techniques:
- PRISM - classification rule covering algorithm, and
- Apriori - association rule mining algorithm.
  Experiment both with regular association rules (car=false in the Apriori parameters), and classification association rules (car=true in the Apriori parameters).
Dataset(s): In this project, we will use two datasets:
- The Zoo Data Set. You can start by using "type" as the classification target. (note that even though the values of this attribute are 0s and 1s, this is a nominal (not a numberic!) attribute). Then you can choose other attributes as target.
- A dataset that you choose depending on your and your group partner's own insterests. It should contain enough instances (at least 200 instances) and several attributes (at least 10). Ideally it should contain a good mix of numeric and nominal attributes. PLEASE AGREE WITH YOUR PARTNER ON WHAT SECOND DATASET YOU BOTH WILL USE BEFORE YOU START WORKING ON YOUR PROJECT.
  I include below some links to Data Repositories containing multiple datasets to choose from:
  You can use other data repositories if you wish.
To the extent possible/necessary, modify the attribute names and the nominal value names so that the resulting classification/association rules are easy to read.
Performance Metric(s):
- Classification Rules:
  - Use classification accuracy to measure the "goodness" of your models.
  - Compare each accuracy you obtained against those of benchmarking techniques as ZeroR and OneR over the same (sub-)set of data instances you used in the corresponding experiment. (You might also run decision trees and compare their accuracy against that of the classification rules.)
- Association Rules:
  - Support and confidence of the rules.
General Comments:
- Association Rules:
  - We won't use any evaluation protocol (e.g., 10-fold cross validation) for the association analysis of this project, as we're not using the association rules for prediction. Focus instead on experimenting with different ways of preprocessing the data, varying the parameters of the Apriori algorithm, and providing your own method to evaluate the resulting collections of association rules.
Code Modifications:
- Classification Rules:
  - [25 points] Once that are done with the join experiments, MODIFY the Prism code so that it uses the p*[log_2(p/t) - log_2(P/T)] measure to rank the attribute-values that are candidates for inclusion in a rule. DESCRIBE in detail in your report how exactly you modified the code. INCLUDE the relevant pieces of code in your report.
    Repeat your joint experiments to see the differences in the results between the p/t and the p*[log_2(p/t) - log_2(P/T)] measures. If none of your joint experiments produces different results, construct at least one dataset in which the two measures produce different results and compare them.

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2008 Homework and Project 2: Data Pre-processing, Mining, and Evaluation of Rules

PROF. CAROLINA RUIZ

HOMEWORK AND PROJECT OBJECTIVES

INDIVIDUAL HOMEWORK ASSIGNMENT

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2008
Homework and Project 2: Data Pre-processing, Mining, and Evaluation of Rules