CS 525D Spring 2008

Computer Science Department

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING - Spring 2008
Project 2: Classification

PROF. CAROLINA RUIZ

DUE DATE: Tuesday February 26, 2008.

Slides: Submit by email by 2:30 pm.
Written report: Hand in a hardcopy by 3:30 pm.
Oral Presentation: during class that day.

This assignment consists of two parts:

A homework part in which you will focus on the construction and/or pruning of the models.
A project part in which you will focus on the experimental evaluation and analysis of the models.

I. Homework Part

[100 points] In this part of the assignment, we will investigate the subtree replacement post-pruning approach used by J4.8. For this, consider the Contact Lenses dataset that come with the Weka system.

[10 points] Run J4.8 over the Contact Lenses dataset using the following parameters: unpruned=True, minNumObj=1, numFolds=2, and default values for the remaining parameters. Include the resulting tree in your report. Note that this tree is the same that ID3 outputs on this dataset.
[90 points total] Run J4.8 over the Contact Lenses dataset using the following parameters: unpruned=False, minNumObj=1, numFolds=2, and default values for the remaining parameters. Include the resulting tree in your report [10 points]. Note that this tree is a pruned version of the tree you included in the first part. Your job is to follow by hand the subtree replacement post-pruning method used by J4.8. For this, you need to look at the code in detail. Include in your written report each step of the process, explicitly describing what subtree is under consideration at each stage of the process, and calculating the values from the formulas to decide whether or not to prune a subtree [80 points]. Show your work.

II. Project Part

[400 points: 100 points per data mining technique per dataset. See Project Guidelines for the detailed distribution of these points]

Project Instructions: Thoroughly read and follow the Project Guidelines. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.
Data Mining Technique(s): We will run experiment using decision trees. You need to use ID3 and J4.8 on each dataset.
Dataset(s): In this project, we will use two datasets:
- The census-income dataset from the US Census Bureau which is available at the Univ. of California Irvine Repository.
  The census-income dataset contains census information for 48,842 people. It has 14 attributes for each person (age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country) and a boolean attribute class classifying the input of the person as belonging to one of two categories >50K, <=50K.
- A dataset that you choose depending on your own insterests. It may be a dataset you are working with for your research or your job. It should contain enough instances (at least 200 instances) and several attributes (at least 10). Ideally it should contain a good mix of numeric and nominal attributes.
  I include below some links to Data Repositories containing multiple datasets to choose from:
Performance Metric(s): Use classification accuracy to measure the goodness of your models.

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING - Spring 2008 Project 2: Classification

PROF. CAROLINA RUIZ

I. Homework Part

II. Project Part

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING - Spring 2008
Project 2: Classification