CS 4341 C00 - Project 2

Computer Science Department

CS4341 Introduction to Artificial Intelligence
Project 2 - C 2000

PROF. CAROLINA RUIZ

Due Date: Wednesday, March 1, 2000 at 6 pm.

Project Description
Project Assignment
Report Submission and Due Date

PROJECT DESCRIPTION

Construct the most accurate decision tree you can for predicting whether the income of a given person is >50K or <= 50K using the census-income dataset from the US Census Bureau which is available at the following directory on the CCC machines: /cs/cs4341/Project2 The files contained in that directory were taken from the Univ. of California-Irvine Machine Learning Repository.

The directory contains the following files

Index: List of files in the directory
census-income.names: Contains a description of the database, including information about the attributes (some are discrete and some are continuous), missing values, etc.
The census-income dataset contains census information for 48,842 people. It has 14 attributes for each person (age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country) and a boolean attribute class classifying the input of the person as belonging to one of the two categories >50K, <=50K.
Out of those 48842 instances (or records), there are 45222 instances without missing values (meaning that a value has been provided for each of the attributes listed above for all such instances).
The database has been split into two parts at random: 2/3 of the records for training (contained in the file census-income.data), and 1/3 of the records for testing (contained in the file census-income.test).
census-income.data: Contains the training records. There are 32561 training instances, 30162 without missing values.
census-income.test: Contains the testing records. There are 16281 test instances, 15060 without missing values.

PROJECT ASSIGNMENT

The following are guidelines for the construction and testing of your decision tree:

Code: You MUST write your own decision tree learning code. There are several pieces of code available online (including a Lisp version) that you may examine for guidance, but the code you submit MUST be your own. Your code must run on the CCC Unix machines.
Training Instances: For the construction of your decision tree, use instances from the census-income.data file. You can restrict your experiments to a subset of the records in that file if your system cannot handle the whole file. But remember that the more accurate your decision tree is, the better. Also, note that this dataset has missing values. You can use only records without missing values, but you can earn extra credit if you use records with missing values as long as you use a sound way to fill in appropriate data for those missing values and you explain it clearly in your written report.
You should pre-process the attributes so that you maximize the accuracy of your decision tree. Pre-processing alternatives include: disregarding an attribute that doesn't seem to have any predictive capability; and "discretizing" continuous values, that is dividing continuous attributes (e.g. age) into a few intervals (e.g. age 10-20, age 21-30, age 31-40, ...).
Test Instances: Test data are available in the census-income.test file. To test your decision tree you must use at least 1000 records from that file (say the first 1000 records without missing values). The accuracy of a decision tree is measured as the percentage of correctly classified instances. That is, accuracy = number of correctly classified instances/ total number of classified instances.

REPORT AND DUE DATE

Project 2 is due on Wednesday, March 1 at 6:00 pm. Your system should follow the CS Department Documentation Standard.

Program and Decision Tree. You should submit (1) the source code of your program and (2) the most accurate decision tree you obtained, using the turnin program.
Written Report. Submit a report.txt file containing your written report. Your report should discuss the following issues:
1. a description of your own code,
2. the description of the (subset of the) dataset used by your program to construct and to test your decision tree,
3. the experiments you ran with the system,
4. the most accurate decision tree constructed by your system,
5. any pre or post processing done to improve the accuracy of your tree,
6. evaluation of your tree using the test data,
7. strengths and weaknesses of your system.
Your report should also include a short user manual explaining how to install, run, and use your system.
Oral Report. We will discuss the results from the different group projects in class on March 2, 2000. Each group should present their results (prepare ONE good overhead slide) and discuss their project solution in class for up to 2 minutes. The oral presentation will be worth 10% of the project grade.

CS4341 Introduction to Artificial Intelligence Project 2 - C 2000

PROF. CAROLINA RUIZ

PROJECT DESCRIPTION

PROJECT ASSIGNMENT

REPORT AND DUE DATE

CS4341 Introduction to Artificial Intelligence
Project 2 - C 2000