CS 525D Fall 2009 - Project 2

Computer Science Department

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2009
Project 2: Classification

PROF. CAROLINA RUIZ

DUE DATE: Thursday October 22, 2009.

Slides: Submit by email by 1:00 pm.
Written report: Hand in a hardcopy by 2:00 pm.
Oral Presentation: during class that day.

This assignment consists of two parts:

A homework part in which you will focus on the construction of the models.
A project part in which you will focus on the experimental evaluation and analysis of the models.

I. Homework Part

[20 points] Calculate the Gain(S,A1) and Gain(S,A2) for the dataset S and attributes A1 and A2 on Slide 8 of the slides used in class to describe the ID3 algorithm. Show each step of the calculation. Include your solution in your written report (and not in your oral report).

II. Project Assignment

THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.

Data Mining Technique(s): We will run experiment using the following decision trees techniques:
- ID3, and
- J4.8 (given that J4.8 is able to handle numeric attributes and missing values directly, make sure to run some experiments with no pre-processing and some experiments with pre-processing, and compare your results).
Dataset(s): In this project, we will use two datasets:
- The census-income dataset from the US Census Bureau which is available at the Univ. of California Irvine Repository.
  The census-income dataset contains census information for 48,842 people. It has 14 attributes for each person (age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country) and a boolean attribute class classifying the input of the person as belonging to one of two categories >50K, <=50K.
- A dataset that you choose depending on your own insterests. It may be a dataset you are working with for your research or your job. It should contain enough instances (at least 200 instances) and several attributes (at least 10). Ideally it should contain a good mix of numeric and nominal attributes.
  I include below some links to Data Repositories containing multiple datasets to choose from:
THIS DATASET CANNOT BE ONE OF THOSE INCLUDED IN THE WEKA SYSTEM.
Performance Metric(s):
- Use (1) classification accuracy, (2) size of the tree, and (3) readability of the tree, as separate measures to evaluate the "goodness" of your models.
- Compare each accuracy you obtained against those of benchmarking techniques as ZeroR and OneR over the same (sub-)set of data instances you used in the corresponding experiment.
- Remember to experiment with pruning of your J4.8 decision tree: Experiment with Weka's J4.8 classifier to see how it performs pre- and/or post-prunning of the decision tree in order to increase the classification accuracy and/or to reduce the size of the decision tree.

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2009 Project 2: Classification

PROF. CAROLINA RUIZ

I. Homework Part

II. Project Assignment

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2009
Project 2: Classification