WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2008 
Homework and Project 2: Data Pre-processing, Mining, and Evaluation of Rules

PROF. CAROLINA RUIZ 

DUE DATE:
The individual homework assignment is due on Friday, Sept. 19 2008 at 1:00 pm, and
The individual+group project is due on Friday, Sept. 19 2008 at 12:00 noon. 

------------------------------------------


HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is multi-fold: Readings: Read in great detail Sections 4.1, 4.4, 4.5 and 6.2 from your textbook.

INDIVIDUAL HOMEWORK ASSIGNMENT

Consider the following dataset, adapted from the
Iris Dataset available at the The University of California Irvine (UCI) Machine Learning Data Repository, (available also in the Weka data directory).
ATTRIBUTES:	POSSIBLE VALUES:

sepallength 	{sl-short,sl-med,sl-long}
petallength 	{pl-short,pl-med,pl-long}
petalwidth 	{pw-short,pw-med,pw-long}
class 		{Iris-setosa,Iris-versicolor,Iris-virginica}
sepallength petallength petalwidth class
sl-shortpl-shortpw-shortIris-setosa
sl-shortpl-shortpw-shortIris-setosa
sl-shortpl-shortpw-shortIris-setosa
sl-longpl-medpw-medIris-versicolor
sl-longpl-longpw-medIris-versicolor
sl-medpl-medpw-medIris-versicolor
sl-medpl-medpw-medIris-versicolor
sl-medpl-longpw-medIris-virginica
sl-medpl-longpw-longIris-virginica
sl-longpl-longpw-longIris-virginica

  1. (50 points) Classification Rules:
    See Solutions by Piotr Mardziel and Amro Khasawneh.

    Construct "by hand" all the perfect classification rules that the Prism algorithm would output for this dataset using the ratio p/t to rank the attribute-values that are candidates for inclusion in a rule. You written solutions should show all your work. That is, the list of all attribute-values that were candidates during each of the stages of the rule construction process and which ones were selected.

  2. (50 points) Association Rules:
    See Solutions by Piotr Mardziel and Amro Khasawneh.

    Mine association rules by hand from this dataset by faithfully following the Apriori algorithm with minimum support = 35% (since the dataset contains 10 instances, then the min support count is 3 instances) and minimum confidence 90%. Note that you need to produce regular association rules, not classification association rules.

    1. (35 points) Generate all the frequent itemsets by hand, level by level. Do it exactly as the Apriori algorithm would. When constructing level k+1 from level k, use the join condition to generate only those candidate itemsets that are potentially frequent, and use the prune condition to remove those candidate itemsets that won't be frequent because at least one of their subsets is not frequent. Mark with an "X" those itemsets removed by the prune condition, and don't count their support in the dataset. SHOW ALL THE DETAILS OF YOUR WORK.
    2. (15 points) In this part, you will generate association rules with minimum confidence 90%. To save time, you don't have to generate all associations rules from all the frequent itemsets. Instead, select the largest itemset (i.e., the itemset with most items) that you generated in the previous part of this problem, and use it to generate all association rules that can be produced from it using all the items in the itemset (i.e., if the itemset contains n items consider only rules that include all n items). For each such rule, calculate its confidence (show the details), and mark those rules that have confidence greater than or equal to 90%. SHOW ALL THE DETAILS OF YOUR WORK.

INDIVIDUAL + GROUP PROJECT ASSIGNMENT
[800 points: 100 points per data mining technique per dataset per individual/group parts. See
Project Guidelines for the detailed distribution of these points]