
DUE DATE: Tuesday February 26, 2008.
- Slides: Submit by email by 2:30 pm.
- Written report: Hand in a hardcopy by 3:30 pm.
- Oral Presentation: during class that day.

This assignment consists of two parts:
- A homework part in which you will focus on the construction and/or
pruning of the models.
- A project part in which you will focus on the experimental evaluation
and analysis of the models.
I. Homework Part
[100 points]
In this part of the assignment, we will investigate the subtree replacement
post-pruning approach used by J4.8. For this, consider the Contact Lenses
dataset that come with the Weka system.
- [10 points] Run J4.8 over the Contact Lenses dataset using the following
parameters: unpruned=True, minNumObj=1, numFolds=2, and default
values for the remaining parameters.
Include the resulting tree in your report.
Note that this tree is the same that ID3 outputs on this dataset.
- [90 points total] Run J4.8 over the Contact Lenses dataset using the following
parameters: unpruned=False, minNumObj=1, numFolds=2, and default
values for the remaining parameters.
Include the resulting tree in your report [10 points].
Note that this tree is a pruned version of the tree you included in
the first part. Your job is to follow by hand the subtree replacement
post-pruning method used by J4.8. For this, you need to look at the
code in detail. Include in your written report each step of the process,
explicitly describing what subtree is under consideration at each
stage of the process, and calculating the values from the formulas
to decide whether or not to prune a subtree [80 points]. Show your work.
II. Project Part
[400 points: 100 points per data mining technique per dataset.
See
Project Guidelines
for the detailed distribution of these points]
- Project Instructions:
Thoroughly read and follow the
Project Guidelines.
These guidelines contain detailed information about how to structure your
project, and how to prepare your written and oral reports.
- Data Mining Technique(s):
We will run experiment using decision trees.
You need to use ID3 and J4.8 on each dataset.
- Dataset(s):
In this project, we will use two datasets:
- The
census-income dataset
from the US Census Bureau which is
available at the
Univ. of California Irvine Repository.
The census-income dataset contains census information for 48,842
people. It has 14 attributes for each person
(age,
workclass,
fnlwgt,
education,
education-num,
marital-status,
occupation,
relationship,
race,
sex,
capital-gain,
capital-loss,
hours-per-week, and
native-country)
and a boolean attribute class classifying the input
of the person as belonging to one of two categories >50K, <=50K.
- A dataset that you choose depending on your own insterests.
It may be a dataset you are working with for your research or your job.
It should contain enough instances (at least 200 instances) and
several attributes (at least 10). Ideally it should contain a good mix of
numeric and nominal attributes.
I include below some links to Data Repositories containing
multiple datasets to choose from:
- Performance Metric(s):
Use classification accuracy to measure the goodness of your models.