CS539 Machine Learning
Assignment 2 - Fall 2000
Due:
Thursday, September 21, 2000 at 6:00 pm.
PROJECT DESCRIPTION
Construct the most accurate decision tree you can
for predicting whether the income of a given person is >50K or <= 50K
using the
census-income dataset
from the US Census Bureau which is
available at the
Univ. of California Irvine Repository.
I have downloaded the dataset into the following directory:
/cs/courses/cs539/f00/Projects/Census_Income_Data
You can access the dataset from there.
The census-income dataset contains census information for 48,842
people. It has 14 attributes for each person
(age,
workclass,
fnlwgt,
education,
education-num,
marital-status,
occupation,
relationship,
race,
sex,
capital-gain,
capital-loss,
hours-per-week, and
native-country)
and a boolean attribute class classifying the input
of the person as belonging to one of two categories >50K, <=50K.
PROJECT ASSIGNMENT
The following are guidelines for the construction of your decision tree:
- Code: You can use
- the
decision tree learning code from Chapter 3 of the textbook
(CMU Lisp runs on the following CS machines: penguin, toucan or crane. Just
type "lisp" to run it.)
- any other decision tree code you find available, or
- your own code.
Your code must run on the CS or CCC Unix machines.
- Training Instances:
Use the
census-income dataset.
You can restrict your experiments to a subset of the dataset if
your system cannot handle the whole dataset. But remember that the
more accurate your system is, the better.
Also,
note that this dataset has missing values. It is up to you how to fill in
appropriate data for those missing values. Also, it is up to you
how to discretize continues attributes.
- Test Instances:
Test data are also available at the UCI.
YOU MUST USE AT LEAST THE FIRST 500 TEST RECORDS FROM THAT TEST
DATA IN YOUR EXPERIMENTS.
REPORT AND DUE DATE
- Written Report.
Please bring your report to my office (FL232) or to class by the due date/time.
Your report should contain the following sections that discuss the issues:
- Code Description:
- Did you write your own code or did you use an existing one?
If you used code provided by someone else:
- provide the reference
- describe any adaptations of the code that you made
- Describe briefly the algorithm that the code you used implements.
- "Control" Experiment
- Use your code to construct a decision tree using:
- the first 1000 instances of the training dataset that
do NOT contain missing values
- the discrete-valued attributes only.
- Include the resulting tree in your report.
- What was the accuracy of that tree when tested over the
first 500 instances of the test data?
- Other Experiments: For each further experiment you ran describe:
- Training Data: What data did you use to construct your decision tree?
- Test Data: What data did you use to test your decision tree?
- Any pre or post processing done to improve the accuracy of your tree.
- Accuracy of the resulting decision tree.
- Summary of Results
- What was the accuracy of the most accurate decision tree constructed by your system?
- Include the most accurate tree you obtained in your report.
- strengths and the weaknesses of your system.
- Code: Include in your report:
- A printout of the code used.
- A short user manual explaining how to install, run, and use your system.
- Oral Report.
We will discuss the results from the individual projects during the class
on September 21st.
Be ready to show your results (prepare transparencies on your results)
and to discuss your project solution in class.