WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING - Spring 2008  
Project 2: Classification

PROF. CAROLINA RUIZ 

DUE DATE: Tuesday February 26, 2008. ------------------------------------------
This assignment consists of two parts:
  1. A homework part in which you will focus on the construction and/or pruning of the models.
  2. A project part in which you will focus on the experimental evaluation and analysis of the models.

I. Homework Part

[100 points] In this part of the assignment, we will investigate the subtree replacement post-pruning approach used by J4.8. For this, consider the Contact Lenses dataset that come with the Weka system.
  1. [10 points] Run J4.8 over the Contact Lenses dataset using the following parameters: unpruned=True, minNumObj=1, numFolds=2, and default values for the remaining parameters. Include the resulting tree in your report. Note that this tree is the same that ID3 outputs on this dataset.

  2. [90 points total] Run J4.8 over the Contact Lenses dataset using the following parameters: unpruned=False, minNumObj=1, numFolds=2, and default values for the remaining parameters. Include the resulting tree in your report [10 points]. Note that this tree is a pruned version of the tree you included in the first part. Your job is to follow by hand the subtree replacement post-pruning method used by J4.8. For this, you need to look at the code in detail. Include in your written report each step of the process, explicitly describing what subtree is under consideration at each stage of the process, and calculating the values from the formulas to decide whether or not to prune a subtree [80 points]. Show your work.

II. Project Part

[400 points: 100 points per data mining technique per dataset. See Project Guidelines for the detailed distribution of these points]