CS 539 Fall 2000 - Assignment 2

Computer Science Department

CS539 Machine Learning
Assignment 2 - Fall 2000

PROF. CAROLINA RUIZ

Due: Thursday, September 21, 2000 at 6:00 pm.

Project Description
Project Assignment
Report Submission and Due Date

PROJECT DESCRIPTION

Construct the most accurate decision tree you can for predicting whether the income of a given person is >50K or <= 50K using the census-income dataset from the US Census Bureau which is available at the Univ. of California Irvine Repository.

I have downloaded the dataset into the following directory: /cs/courses/cs539/f00/Projects/Census_Income_Data
You can access the dataset from there.

The census-income dataset contains census information for 48,842 people. It has 14 attributes for each person (age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country) and a boolean attribute class classifying the input of the person as belonging to one of two categories >50K, <=50K.

PROJECT ASSIGNMENT

The following are guidelines for the construction of your decision tree:

Code: You can use

the decision tree learning code from Chapter 3 of the textbook (CMU Lisp runs on the following CS machines: penguin, toucan or crane. Just type "lisp" to run it.)
any other decision tree code you find available, or
your own code.
Your code must run on the CS or CCC Unix machines.
Training Instances: Use the census-income dataset. You can restrict your experiments to a subset of the dataset if your system cannot handle the whole dataset. But remember that the more accurate your system is, the better. Also, note that this dataset has missing values. It is up to you how to fill in appropriate data for those missing values. Also, it is up to you how to discretize continues attributes.
Test Instances: Test data are also available at the UCI. YOU MUST USE AT LEAST THE FIRST 500 TEST RECORDS FROM THAT TEST DATA IN YOUR EXPERIMENTS.

REPORT AND DUE DATE

Written Report. Please bring your report to my office (FL232) or to class by the due date/time. Your report should contain the following sections that discuss the issues:
1. Code Description:
  - Did you write your own code or did you use an existing one?
    If you used code provided by someone else:
    - provide the reference
    - describe any adaptations of the code that you made
  - Describe briefly the algorithm that the code you used implements.
2. "Control" Experiment
  - Use your code to construct a decision tree using:
    - the first 1000 instances of the training dataset that do NOT contain missing values
    - the discrete-valued attributes only.
  - Include the resulting tree in your report.
  - What was the accuracy of that tree when tested over the first 500 instances of the test data?
3. Other Experiments: For each further experiment you ran describe:
  - Training Data: What data did you use to construct your decision tree?
  - Test Data: What data did you use to test your decision tree?
  - Any pre or post processing done to improve the accuracy of your tree.
  - Accuracy of the resulting decision tree.
4. Summary of Results
  - What was the accuracy of the most accurate decision tree constructed by your system?
  - Include the most accurate tree you obtained in your report.
  - strengths and the weaknesses of your system.
5. Code: Include in your report:
  - A printout of the code used.
  - A short user manual explaining how to install, run, and use your system.
Oral Report. We will discuss the results from the individual projects during the class on September 21st. Be ready to show your results (prepare transparencies on your results) and to discuss your project solution in class.

CS539 Machine Learning Assignment 2 - Fall 2000

PROF. CAROLINA RUIZ

PROJECT DESCRIPTION

PROJECT ASSIGNMENT

REPORT AND DUE DATE

CS539 Machine Learning
Assignment 2 - Fall 2000