### CS539 Machine Learning - Spring 2003  Project 5 - Bayesian Learning

#### PROF. CAROLINA RUIZ

Due Date: Monday, March 10 2003 at 8 am.

#### PROJECT DESCRIPTION

Use the NaiveBayes and the NaiveBayesSimple to construct Naive Bayes classifiers for each of the following problems:

1. Predicting the class attribute (CARAVAN Number of mobile home policies) in the The Insurance Company Benchmark (COIL 2000) dataset.

2. Predicting whether the income of a given person is >50K or <= 50K using the census-income dataset from the US Census Bureau which is available at the Univ. of California Irvine Repository.
The census-income dataset contains census information for 48,842 people. It has 14 attributes for each person (age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country) and a boolean attribute class classifying the input of the person as belonging to one of two categories >50K, <=50K.

#### PROJECT ASSIGNMENT

1. Read Chapter 6 of the textbook about Bayesian Learning in great detail.

2. Read the NaiveBayes and the NaiveBayesSimple code in the Weka system in great detail.

3. The following are guidelines for the construction of your Naive Bayes Classifiers:

#### REPORT AND DUE DATE

• Written Report.

Your report should contain the following sections with the corresponding discussions:

1. Code Description: Describe the NaiveBayes and the NaiveBayesSimple code that you used from Weka. Explain the algorithm underlying the code in terms of the input it receives and the output it produces, and the main steps it follows to produce this output.
ALSO, EXPLAIN THE DIFFERENCE BETWEEN THE TWO APPROACHES: NaiveBayes and the NaiveBayesSimple

2. Data: Describe the dataset that you selected in terms of the attributes present in the data, the number of instances, missing values, and other relevant characteristics.

Provide a detail description of the preprocessing of your data. Justify the preprocessing you apply and why the resulting data is the appropriate one for mining neural networks from it.

3. Experiments: For each experiment you ran describe:
• Data: What data did you use to construct and test your classifier?
• Any additional pre or post processing done to the data or the classifier's output in order to improve the accuracy of your classifier.
• Accuracy of the resulting classifier.
• Discuss how this accuracy compares with that of your most accurate ZeroR experiment, decision trees, and neural nets from the previous assignments.

4. Summary of Results
• For each dataset, what was the accuracy of the most accurate classifier constructed in your project?
• strengths and the weaknesses of your project.

• Oral Report. We will discuss the results from the individual projects during the class on March 10. Your oral report should summarize the different sections of your written report as described above. Each of you will have 5 minutes to explain your results and to discuss your project in class. Be prepared!

• Submission and Due Date.

Please submit the following files by email to ruiz@cs.wpi.edu by the deadline specified below. Submissions received on Mondays, between 8:01 am and 10:00 am will be penalized with 30% off the grade and submissions after 10:00 am won't be accepted.

1. [your-lastname]_proj5_slides.[ext] containing your slides for your oral report of Part 1. This file should be either a PDF file (ext=pdf) or a PowerPoint file (ext=ppt). Please use only lower case letters in the name file. For instance my file would be named ruiz_proj3_slides_part1.ppt

2. [your-lastname]_proj5_report.pdf containing your written report in PDF.
***** ALSO, PLEASE BRING A HARDCOPY OF YOUR REPORT TO CLASS ON MARCH 10, 2003. ****