### CS 525D KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2009 Project 2: Classification

#### PROF. CAROLINA RUIZ

DUE DATE: Thursday October 22, 2009.
• Slides: Submit by email by 1:00 pm.
• Written report: Hand in a hardcopy by 2:00 pm.
• Oral Presentation: during class that day.

This assignment consists of two parts:
1. A homework part in which you will focus on the construction of the models.
2. A project part in which you will focus on the experimental evaluation and analysis of the models.

### I. Homework Part

[20 points] Calculate the Gain(S,A1) and Gain(S,A2) for the dataset S and attributes A1 and A2 on Slide 8 of the slides used in class to describe the ID3 algorithm. Show each step of the calculation. Include your solution in your written report (and not in your oral report).

### II. Project Assignment

• Data Mining Technique(s): We will run experiment using the following decision trees techniques:
• ID3, and
• J4.8 (given that J4.8 is able to handle numeric attributes and missing values directly, make sure to run some experiments with no pre-processing and some experiments with pre-processing, and compare your results).

• Dataset(s): In this project, we will use two datasets:
• The census-income dataset from the US Census Bureau which is available at the Univ. of California Irvine Repository.
The census-income dataset contains census information for 48,842 people. It has 14 attributes for each person (age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country) and a boolean attribute class classifying the input of the person as belonging to one of two categories >50K, <=50K.

• A dataset that you choose depending on your own insterests. It may be a dataset you are working with for your research or your job. It should contain enough instances (at least 200 instances) and several attributes (at least 10). Ideally it should contain a good mix of numeric and nominal attributes.
I include below some links to Data Repositories containing multiple datasets to choose from:
THIS DATASET CANNOT BE ONE OF THOSE INCLUDED IN THE WEKA SYSTEM.

• Performance Metric(s):
• Use (1) classification accuracy, (2) size of the tree, and (3) readability of the tree, as separate measures to evaluate the "goodness" of your models.
• Compare each accuracy you obtained against those of benchmarking techniques as ZeroR and OneR over the same (sub-)set of data instances you used in the corresponding experiment.
• Remember to experiment with pruning of your J4.8 decision tree: Experiment with Weka's J4.8 classifier to see how it performs pre- and/or post-prunning of the decision tree in order to increase the classification accuracy and/or to reduce the size of the decision tree.