### CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2012  Homework and Project 3: Bayesian Models

#### Prof. Carolina Ruiz and Ken Loomis

DUE DATES: Friday, Nov. 16, 11:00 am (electronic submission) and 1:00 pm (hardcopy submission)

#### HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is multi-fold:

• To gain experience with the construction and evaluation of Bayesian models.

#### HOMEWORK AND PROJECT ASSIGNMENTS

This project consists of two parts:

1. Part I. INDIVIDUAL HOMEWORK ASSIGNMENT

Consider the following dataset.

```@relation movie-preferences

@attribute genre {comedy, drama, action}
@attribute critics-reviews {thumbs-up, neutral, thumbs-down}
@attribute rating {R, PG-13}
@attribute IMAX {true, false}
@attribute likes {yes, no}

@data
( 1) comedy, thumbs-up,   R,     false, no
( 2) comedy, thumbs-up,   R,     true,  no
( 3) comedy, neutral,     R,     false, no
( 4) comedy, thumbs-down, PG-13, false, yes
( 5) comedy, neutral,     PG-13, true,  yes
( 6) drama,  thumbs-up,   R,     false, yes
( 7) drama,  thumbs-down, PG-13, true,  yes
( 8) drama,  neutral,     R,     true,  yes
( 9) drama,  thumbs-up,   PG-13, false, yes
(10) action, neutral,     R,     false, yes
(11) action, thumbs-down, PG-13, false, yes
(12) action, thumbs-down, PG-13, true,  no
(13) action, neutral,     PG-13, false, yes
(14) action, neutral,     R,     true,  no
```
where the likes attribute is the classification target.

1. (30 points) Construct the full naive Bayes model for this dataset. Show all the steps of the calculations. Draw the resulting graph and the Conditional Probability Table (CPT) associated with each node in the graph.

2. (5 points) Use your model above to classify the following data instance. Explain your work.
```genre = action, critics-reviews = ?, rating  = R, IMAX = ?
```

3. (10 points) Consider the following Bayesian net over the dataset above. Construct the conditional probability table for the critics-reviews node. Show your work.

2. Part II. GROUP PROJECT ASSIGNMENT

• Project Instructions: THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.
*** NEW FOR THIS PROJECT: The written report for your group project should be at most 10 pages long (including all graphs, tables, figures, appendices, ...) and the font size should be no smaller than 11 pts. ***

• Data Mining Technique(s): We will run experiment using the following techniques:
• Pre-processing Techniques:
• Feature selection, feature creation, dimensionality reduction, noise reduction, attribute discretization, ...

• Data Mining Techniques:
• Naive Bayes
• Bayesian Nets
Given that these techniques are able to handle numeric attributes and missing values directly, make sure to run some experiments with no pre-processing and some experiments with pre-processing, and compare your results. Experiment also with different parameter values to see how they affect the graphs and CPTs of the models, and their classification performance.

• You can consider using advanced techniques to improve the accuracy of your predictions. For instace, you can try ensemble methods (see Section 5.6 of your textbook), ways to deal with inbalanced classification targets (see Section 5.7 of your textbook), etc.

3. Dataset: We will work with the same dataset used in project 1. The following 2 files contain the dataset:
Important: For all experiments, perform missing value replacement for the target attribute. Replace the missing values with a new nominal value called "Missing". Or use the dataset that you may have saved for Project 1 as suggested at the beginning of the moderate challenge.

4. Challenges: In each of the following challenges provide a detailed description of the preprocessing techniques used, the motivation for using these techniques, and any hypothesis/intuition gained about the information represented in the dataset. Answer the question provided as well as provide the information described in the PROJECT GUIDELINES.

• Easy Level: This is to be a simple guided experimentation, thus little description is needed for preprocessing techniques.

Create a Bayesian network model using Weka's implemenation of NaiveBayes using the default parameters. Use AYP2012 as the target attribute.

Create a Bayesian network model using Weka's implemenation of BayesNet using the default parameters. Use AYP2012 as the target attribute.

1. Compare the topology of the two models generated. Are they different? If so how? If not, why not?
2. Examine the conditional probability tables for the attribute "LocationType" for each of models. How are these tables different? How are they similar?
3. Compare the accuracy of the two models. Are the accuracies of the models significantly different? Assume a significance of 2% variance for this comparison. Is this a reasonable assumption?

• Moderate Level: This is a bit more of a challenge (be sure to leave yourself time for challenges 3 and 4).

Use modified parameters and preprocessing techniques to generate a NaiveBayes that classifies AYP2012. Provide detailed descriptions about the parameters used to develop your model and/or preprocessing techniques used. One should be able to repeat the experiment from your description.

Examine the model. Compare and contrast this model against a ZeroR model, a OneR model, models generated in Project 2, and models generated in the challenge above. Answer the following questions in your description about this experiment:

1. Is this model a better model than the other models? If so, why? If not, why not?
2. Did you find that using a dataset with no preprocessing but modified parameters produced a better model that using a preprocessed dataset with default parameters? Explain why or why not.
3. What challenge(s) did you encounter while developing this model? Give an more detailed explanation of how you used preprocessing, postprocessing, or some other technique to overcome a specific challenge.

• WPI Level: This and the WPI+ are the big challenges that should spend the most time on.

Use preprocessing and modified parameters to generate a BayesNet that attempts to accurately classify AYP2012. Provide detailed descriptions about the parameters used to develop your model and/or preprocessing techniques used. One should be able to repeat the experiment from your description. Describe the topology of the model produced.

Examine the model. Compare and contrast this model against a ZeroR model, a OneR model, NaiveBayes, and models generated in Project 2.

1. Describe how this model is different from the BayesNet obtained in the "easy" challenge.
2. What challenge(s) did you encounter while developing this model? Give an more detailed explanation of how you used preprocessing, postprocessing, or some other technique to overcome a specific challenge.
3. Describe any anomolies that appeared in the topology. What might these anomalies mean about the data?

• WPI+ Level: This and the WPI are the big challenges that should spend the most time on.

Design another experiment with a different goal other than the ones that have appeared previously in this assignment. Provide detailed descriptions about the parameters used to develop your model and/or preprocessing techniques used. One should be able to repeat the experiment from your description. Describe the topology of the model produced. Compare the performance of your model against ZeroR, OneR, and J4.8 models.

1. What was your motivation for choosing this goal? Is it very useful?
2. Are there any limitations of the dataset that make this a more challenging experiment? Explain.
3. Describe any anomolies that appeared in the topology. What might these anomalies mean about the data?

5. Grading sheet for this project.