CS539 Machine Learning
Assignment Chapter 6 - Fall 2000
Due:
First Part: Thursday, October 19, 2000 at 6:00 pm.
Second Part: Thursday, October 26, 2000 at 6:00 pm.
PROJECT DESCRIPTION
Construct the most accurate naive Bayes classifier you can
for predicting whether the income of a given person is >50K or <= 50K
using the
census-income dataset
from the US Census Bureau which is
available at the
Univ. of California Irvine Repository.
I have downloaded the dataset into the following directory:
/cs/courses/cs539/f00/Projects/Census_Income_Data
You can access the dataset from there.
The census-income dataset contains census information for 48,842
people. It has 14 attributes for each person
(age,
workclass,
fnlwgt,
education,
education-num,
marital-status,
occupation,
relationship,
race,
sex,
capital-gain,
capital-loss,
hours-per-week, and
native-country)
and a boolean attribute class classifying the input
of the person as belonging to one of two categories >50K, <=50K.
PROJECT ASSIGNMENT
This project consists of two parts:
Part 1: Due October 19 at 6:00 pm.
STUDY the
C code for the naive Bayes classifier (Rainbow)
provided with Chapter 6 of the textbook.
Adapt the code to the Census-income data as needed.
Run preliminary experiments with this code over the dataset.
Be ready to discuss with your classmates the code as well as
the results of your experiments.
Part 2: Due October 26 at 6:00 pm.
Construct, train, and test
the most accurate naive Bayes classifier you can to predict the Salary
attribute of the Census-Income data.
The following are guidelines to construct and train your naive Bayes classifier:
- Code: You must use the
C code for naive Bayes classification
from Chapter 6 of the textbook. Adapt this code as needed.
- Training Instances:
Use the
census-income dataset.
You can restrict your experiments to a subset of the dataset if
your system cannot handle the whole dataset. But remember that the
more accurate your system is, the better.
Also,
note that this dataset has missing values. It is up to you how to fill in
appropriate data for those missing values. Also, it is up to you
to decide if it's a good idea to discretize continues attributes, and if
so, how.
- Test Instances:
Test data are also available at the UCI.
YOU MUST USE AT LEAST THE FIRST 1000 TEST RECORDS FROM THAT TEST
DATA IN YOUR EXPERIMENTS.
REPORT AND DUE DATE
- Written Report.
Please bring your report to my office (FL232) or to class by the due date/time.
Your report should contain the following sections that discuss the issues:
- Code Description:
Describe any adaptations of the code that you made.
- Experiments: For each further experiment you ran describe:
- Training Data: What data did you use to construct your naive Bayes classifier?
- Test Data: What data did you use to test your naive Bayes classifier?
- Any pre or post processing done to improve the accuracy of your net.
- Accuracy of the resulting naive Bayes classifier.
- Summary of Results
- What was the accuracy of the most accurate naive Bayes classifier
you obtained?
- Discuss how this accuracy compares with that of your
most accurate decision tree and neural network from the previous assignments.
- Include a description of the most accurate naive Bayes classifier you obtained
in your report.
- Discuss the strengths and the weaknesses of your system.
- Oral Report.
We will discuss the results from the individual projects during the class
on October 26.
Be ready to show your results (prepare transparencies on your results)
and to discuss your project solution in class.