DUE DATE: Thursday October 22, 2009.
- Slides: Submit by email by 1:00 pm.
- Written report: Hand in a hardcopy by 2:00 pm.
- Oral Presentation: during class that day.
This assignment consists of two parts:
- A homework part in which you will focus on the construction of
the models.
- A project part in which you will focus on the experimental evaluation
and analysis of the models.
I. Homework Part
[20 points]
Calculate the Gain(S,A1) and Gain(S,A2) for the dataset S and attributes A1
and A2 on Slide 8 of
the slides
used in class to describe the ID3 algorithm.
Show each step of the calculation.
Include your solution in your written report (and not in your oral report).
II. Project Assignment
THOROUGHLY READ AND FOLLOW THE
PROJECT GUIDELINES.
These guidelines contain detailed information about how to structure your
project, and how to prepare your written and oral reports.
- Data Mining Technique(s):
We will run experiment using the following decision trees techniques:
- ID3, and
- J4.8 (given that J4.8 is able to handle numeric attributes and
missing values directly, make sure to run
some experiments with no pre-processing
and
some experiments with pre-processing, and compare your results).
- Dataset(s):
In this project, we will use two datasets:
- The
census-income dataset
from the US Census Bureau which is
available at the
Univ. of California Irvine Repository.
The census-income dataset contains census information for 48,842
people. It has 14 attributes for each person
(age,
workclass,
fnlwgt,
education,
education-num,
marital-status,
occupation,
relationship,
race,
sex,
capital-gain,
capital-loss,
hours-per-week, and
native-country)
and a boolean attribute class classifying the input
of the person as belonging to one of two categories >50K, <=50K.
- A dataset that you choose depending on your own insterests.
It may be a dataset you are working with for your research or your job.
It should contain enough instances (at least 200 instances) and
several attributes (at least 10). Ideally it should contain a good mix of
numeric and nominal attributes.
I include below some links to Data Repositories containing
multiple datasets to choose from:
THIS DATASET CANNOT BE ONE OF THOSE INCLUDED IN THE WEKA SYSTEM.
- Performance Metric(s):
- Use (1) classification accuracy, (2) size of the tree, and (3) readability
of the tree, as separate measures to evaluate the "goodness" of your models.
- Compare each accuracy you obtained against those of benchmarking techniques
as ZeroR and OneR over the same (sub-)set of data instances you used in
the corresponding experiment.
- Remember to experiment with pruning of your J4.8 decision tree:
Experiment with Weka's J4.8 classifier to see how it performs pre- and/or
post-prunning of the decision tree in order to increase the classification
accuracy and/or to reduce the size of the decision tree.