### CS539 Machine Learning  Assignment Chapter 6 - Fall 2000

#### PROF. CAROLINA RUIZ

Due:
First Part: Thursday, October 19, 2000 at 6:00 pm.
Second Part: Thursday, October 26, 2000 at 6:00 pm.

#### PROJECT DESCRIPTION

Construct the most accurate naive Bayes classifier you can for predicting whether the income of a given person is >50K or <= 50K using the
census-income dataset from the US Census Bureau which is available at the Univ. of California Irvine Repository.

I have downloaded the dataset into the following directory: /cs/courses/cs539/f00/Projects/Census_Income_Data
You can access the dataset from there.

The census-income dataset contains census information for 48,842 people. It has 14 attributes for each person (age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country) and a boolean attribute class classifying the input of the person as belonging to one of two categories >50K, <=50K.

#### PROJECT ASSIGNMENT

This project consists of two parts:
##### Part 1: Due October 19 at 6:00 pm.
STUDY the
C code for the naive Bayes classifier (Rainbow) provided with Chapter 6 of the textbook. Adapt the code to the Census-income data as needed. Run preliminary experiments with this code over the dataset. Be ready to discuss with your classmates the code as well as the results of your experiments.
##### Part 2: Due October 26 at 6:00 pm.
Construct, train, and test the most accurate naive Bayes classifier you can to predict the Salary attribute of the Census-Income data. The following are guidelines to construct and train your naive Bayes classifier:
• Code: You must use the C code for naive Bayes classification from Chapter 6 of the textbook. Adapt this code as needed.

• Training Instances: Use the census-income dataset. You can restrict your experiments to a subset of the dataset if your system cannot handle the whole dataset. But remember that the more accurate your system is, the better. Also, note that this dataset has missing values. It is up to you how to fill in appropriate data for those missing values. Also, it is up to you to decide if it's a good idea to discretize continues attributes, and if so, how.

• Test Instances: Test data are also available at the UCI. YOU MUST USE AT LEAST THE FIRST 1000 TEST RECORDS FROM THAT TEST DATA IN YOUR EXPERIMENTS.

#### REPORT AND DUE DATE

• Written Report. Please bring your report to my office (FL232) or to class by the due date/time. Your report should contain the following sections that discuss the issues:

1. Code Description: Describe any adaptations of the code that you made.

2. Experiments: For each further experiment you ran describe:
• Training Data: What data did you use to construct your naive Bayes classifier?
• Test Data: What data did you use to test your naive Bayes classifier?
• Any pre or post processing done to improve the accuracy of your net.
• Accuracy of the resulting naive Bayes classifier.

3. Summary of Results
• What was the accuracy of the most accurate naive Bayes classifier you obtained?
• Discuss how this accuracy compares with that of your most accurate decision tree and neural network from the previous assignments.
• Include a description of the most accurate naive Bayes classifier you obtained in your report.
• Discuss the strengths and the weaknesses of your system.

• Oral Report. We will discuss the results from the individual projects during the class on October 26. Be ready to show your results (prepare transparencies on your results) and to discuss your project solution in class.