WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2006 
Homework and Project 2: Data Pre-processing, Mining, and Evaluation of Rules

PROF. CAROLINA RUIZ 

DUE DATE:
Part I (the individual homework assignment) is due on Tuesday, November 14 2006 at 12:00 noon, and
Parts II.1 and II.2 (the individual+group project) are due on Friday, November 17 2006 at 12:00 noon. 

------------------------------------------


HOMEWORK AND PROJECT DESCRIPTION

The purpose of this project is multi-fold: Readings: Read in great detail Sections 4.1, 4.4, 4.5 and 6.2 from your textbook.

INDIVIDUAL HOMEWORK ASSIGNMENT

See solutions to this homework by Piotr Mardziel:

Consider the following dataset, adapted from the Car Evaluation Dataset available at the The University of California Irvine (UCI) Machine Learning Data Repository.

ATTRIBUTES:	POSSIBLE VALUES:
buying-price 	{vhigh,high,med,low}
maintenance 	{vhigh,high,med,low}
persons 	{2,4,more}  % Assumed to be a nominal attribute
safety 		{low,med,high}
recommendation 	{unacc,acc,good}
buying-price maintenance persons safety recommendation
high med 4 high good
low med 2 med unacc
low high 2 high unacc
low vhigh more med acc
med vhigh 4 med acc
vhigh vhigh 4 med unacc
med med more med acc
med vhigh more low unacc
med low 4 med acc
high med 4 low unacc
low med 4 high good
low low 2 high unacc

  1. (50 points) Construct "by hand" all the perfect classification rules that the Prism algorithm would output for this dataset using the ratio p/t to rank the attribute-values that are candidates for inclusion in a rule. You written solutions should show all your work. That is, the list of all attribute-values that were candidates during each of the stages of the rule construction process and which ones were selected.

  2. (50 points) Mine association rules by hand from this dataset by faithfully following the Apriori algorithm with minimum support = 25% (since the dataset contains 12 instances, then the min support count is 3 instances) and minimum confidence 90%. That is, start by generating candidate itemsets and frequent itemsets level by level and after all frequent itemsets have been generated, produce from them all the rules with confidence greater than or equal to the min. confidence. SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.

Submission and Due Date.

Part I is due Tuesday, Nov. 14th at 12:00 noon. Bring a hardcopy of your homework solutions to class.

PROJECT ASSIGNMENT

The following are general guidelines for the project.

Datasets:

Together with your project partner, choose two datasets from the following three options:

  1. The Titanic Dataset. Look at the dataset description and the Data instances.

    I suggest you use the following nominal values for the attributes rather than 0's and 1's to make the association rules easier to read:

    Class (0 = crew, 1 = first, 2 = second, 3 = third)
    Age   (1 = adult, 0 = child)
    Sex   (1 = male, 0 = female)
    Survived (1 = yes, 0 = no)
    
    The "Survived" attribute is the class/target attribute of the Titanic Data.

  2. 1995 Data Analysis Exposition. This dataset contains college data taken from the U.S. News & World Report's Guide to America's Best Colleges. The necessary files are: Let's make "private/public" the classification target. Note that even though the values of this attribute are 0s and 1s, this is a nominal (not a numeric!) attribute.

  3. A dataset of your choice. This dataset can be one available on a public, online data repository (including but not limited to the datasets used on Project 1) or any other valid source. The dataset should contain at least 500 data instances with at least 5 different attributes (ideally some numeric and some nominal).

Experiments:

For each of the two datasets, use the Weka system to perform the following operations:

PROJECT SUBMISSION AND DUE DATE

Part II is due Friday, Nov. 17 at 12:00 noon. BRING A HARDCOPY OF THE INDIVIDUAL AND GROUP WRITTEN REPORTS WITH YOU TO CLASS. In addition, you must submit your report electronically as specified below. Submissions received on Friday, Nov. 17th between 12:01 pm and 12:00 midnight will be penalized with 30% off the grade, submissions received on Saturday Nov. 18th between 12:01 am (early morning) and 8:00 am will will be penalized with 60% off the grade; and submissions received after Saturday Nov. 18th at 8:00 am won't be accepted.

Please submit the following files using the myWpi digital drop box:

  1. [lastname]_proj2_report.[ext] containing your individual written reports. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):

    If you are taking this course for grad. credit, state this fact at the beginning of your report. In this case you submit only an individual report containing both the "individual" and the "group" parts, as you are working all by yourself on the projects.

  2. [lastname1_lastname2]_proj2_report.[ext] containing your group written reports. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):

  3. [lastname1_lastname2]_proj2_slides.[ext] (or [lastname]_proj2_slides.[ext] in the case of students taking this course for graduate credit) containing your slides for your oral reports. This file should be either a PDF file (ext=pdf) or a PowerPoint file (ext=ppt). Your group will have only 4 minutes in class to discuss the entire project (both individual and group parts, and classification and association rules).

GRADING CRITERIA

INDIVIDUAL

(TOTAL 15 points) Class presentation - how well your oral presentation summarized concisely the results of the project and how focus your presentation was on the more creative/interesting/useful of your experiments and results. This grade is given individually to each team member.

Classification Rules

(TOTAL: 15 points for individual work) ALGORITHMIC DESCRIPTION OF THE CODE (05 points) Description of the algorithm underlying the Weka filters used (15 points) Description of the algorithm underlying the construction and pruning of classification rules in Weka's PRISM code (up to 5 extra credit points for an outstanding job) (providing just a structural description of the code, that is, a list of classes and methods, will receive 0 points) (15 POINTS TOTAL: 15 points for individual) PRE-PROCESSING OF THE DATASET: (05 points) Discretizing attributes as needed (05 points) Dealing with missing values appropriately (05 points) Dealing with attributes appropriately (i.e., using nominal values instead of numeric when appropriate, using as many of them as possible, etc.) (up to 5 extra credit points) Trying to do "fancier" things with attributes (i.e., combining two attributes highly correlated into one, using background knowledge, etc.) (72 POINTS TOTAL: 72 points for individual) EXPERIMENTS FIRST DATASET (36 points) (12 points) ran a good number of experiments to get familiar with the PRISM classification method and different evaluation methods (%split, cross-validation,...) (08 points) good description of the experiment setting and the results (12 points) good analysis of the results of the experiments (04 points) discussion of weaknesses and/or strengths of the Prism algorithm and its application to the dataset. (up to 5 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments selecting other classification attributes instead of the required in this project statement ("private/public", "Survived") SECOND DATASET (36 points) (12 points) ran a good number of experiments to get familiar with the PRISM classification method and different evaluation methods (%split, cross-validation,...) (08 points) good description of the experiment setting and the results (12 points) good analysis of the results of the experiments (04 points) discussion of weaknesses and/or strengths of the Prism algorithm and its application to the dataset. (up to 5 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments selecting other classification attributes instead of the required in this project statement ("private/public", "Survived")

Association Rules

(TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION (05 points) Description of the algorithm underlying the Weka filters used (10 points) Description of the Apriori algorithm for the construction of frequent itemsets and association rules. (up to 5 extra credit points for an outstanding job) (providing just a structural description of the code, that is, a list of classes and methods, will receive 0 points) (15 POINTS TOTAL: 15 points for individual) PRE-PROCESSING OF THE DATASET: (05 points) Discretizing attributes as needed (05 points) Dealing with missing values appropriately (05 points) Dealing with attributes appropriately (i.e., using nominal values instead of numeric when appropriate, using as many of them as possible, etc.) (up to 5 extra credit points) Trying to do "fancier" things with attributes (i.e., combining two attributes highly correlated into one, using background knowledge, etc.) (65 POINTS TOTAL: 65 points for individual) EXPERIMENTS FIRST DATASET (33 points) (10 points) ran a good number of experiments to get familiar with the Apriori algorithm varying the input parameters (05 points) good description of the experiment setting and the results (13 points) good analysis of the results of the experiments INCLUDING discussion of particularly interesting association rules obtained. (05 points) comparison of the association rules obtained by Apriori and the classification rules obtained by Prism in project 2. Argumentation of weaknesses and/or strengths of each of the methods on this dataset, and argumentation of which method should be preferred for this dataset and why. (up to 5 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments SECOND DATASET (33 points) (10 points) ran a good number of experiments to get familiar with the Apriori algorithm varying the input parameters (05 points) good description of the experiment setting and the results (13 points) good analysis of the results of the experiments INCLUDING discussion of particularly interesting association rules obtained. (05 points) comparison of the association rules obtained by Apriori and the classification rules obtained by Prism in project 2. Argumentation of weaknesses and/or strengths of each of the methods on this dataset, and argumentation of which method should be preferred for this dataset and why. (up to 5 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments

JOINT

(TOTAL 10 points) SLIDES - how well do they summarize concisely the results of the project? We suggest you summarize the setting of your experiments and their results in a tabular manner.

Classification Rules

(10 POINTS TOTAL: 10 points for group work) PRE-PROCESSING OF THE DATASET: (05 points) Discretizing attributes as needed (05 points) Dealing with missing values appropriately (up to 5 extra credit points) Trying to do "fancier" things with attributes (i.e., combining two attributes highly correlated into one, using background knowledge, etc.) (48 POINTS TOTAL: 48 points for group work) EXPERIMENTS FIRST DATASET (24 points) (08 points) good description of the experiment setting and the results (12 points) good analysis of the results of the experiments (04 points) discussion of weaknesses and/or strengths of the Prism algorithm and its application to the dataset. (up to 5 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments selecting other classification attributes instead of the required in this project statement ("private/public", "Survived") SECOND DATASET (24 points) (08 points) good description of the experiment setting and the results (12 points) good analysis of the results of the experiments (04 points) discussion of weaknesses and/or strengths of the Prism algorithm and its application to the dataset. (up to 5 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments selecting other classification attributes instead of the required in this project statement ("private/public", "Survived") (TOTAL: 30 points for group work) CODE MODIFICATION: (10 points) Description of the algorithmic modification (20 points) Description of the modifications made to the Prism code (up to 10 extra credit points for an outstanding job)

Association Rules

(10 POINTS TOTAL: 10 points for group work) PRE-PROCESSING OF THE DATASET: (05 points) Discretizing attributes as needed (05 points) Dealing with missing values appropriately (up to 5 extra credit points) Trying to do "fancier" things with attributes (i.e., combining two attributes highly correlated into one, using background knowledge, etc.) (45 POINTS TOTAL: 45 points for group work) EXPERIMENTS FIRST DATASET (23 points) (05 points) good description of the experiment setting and the results (13 points) good analysis of the results of the experiments INCLUDING discussion of particularly interesting association rules obtained. (05 points) comparison of the association rules obtained by Apriori and the classification rules obtained by Prism in project 2. Argumentation of weaknesses and/or strengths of each of the methods on this dataset, and argumentation of which method should be preferred for this dataset and why. (up to 5 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments SECOND DATASET (23 points) (05 points) good description of the experiment setting and the results (13 points) good analysis of the results of the experiments INCLUDING discussion of particularly interesting association rules obtained. (05 points) comparison of the association rules obtained by Apriori and the classification rules obtained by Prism in project 2. Argumentation of weaknesses and/or strengths of each of the methods on this dataset, and argumentation of which method should be preferred for this dataset and why. (up to 5 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments (TOTAL: 35 points for group work) CODE MODIFICATION: (10 points) Description of the algorithmic modification (20 points) Description of the modifications made to the Apriori code (up to 10 extra credit points for an outstanding job)