### CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2012  Homework and Project 4: Classification Rules and Association Rules

#### Prof. Carolina Ruiz and Ken Loomis

DUE DATES: Friday, Nov. 30, 11:00 am (electronic submission) and 1:00 pm (hardcopy submission)

#### HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is multi-fold:

• To gain experience with the construction and evaluation of classification rules.
• To gain experience with the construction and evaluation of association rules.

#### HOMEWORK AND PROJECT ASSIGNMENTS

This project consists of two parts:

1. Part I. INDIVIDUAL HOMEWORK ASSIGNMENT

See solutions by Ken Loomis.

Consider the reduced_userprofile.arff dataset. This dataset was constructed from the userprofile.csv file of the Restaurant & consumer data available at the UCI Data Repository.

1. Classification Rules using RIPPER:
Use Weka, Excel, your own code, or other application to help you calculate the metrics used by RIPPER.
1. (20 points) Construct the first rule that RIPPER would contruct over this dataset. Show your work and explain each step of the process.
2. (10 points) Describe in words the process that RIPPER would follow to prune this rule. (No need for you to prune the rule, you just need to describe what RIPPER would do.)

2. Association Rules using Apriori:
Use Weka, Excel, your own code, or other application to help you calculate support of the itemsets constructed below.
1. Generate all frequent itemsets over this dataset, following the Apriori algorithm level by level. Show each step of the process. Let min-support = 40% (that is, min-support-count = 55 instances).
1. (5 points) State what the "join" condition is (called "merge" in the Fk-1xFk-1 method in your textbook p. 341).
2. (5 points) State what the "subset" condition is (called "candidate pruning" in the Fk-1xFk-1 method in your textbook p. 341).
3. For each level,
1. (10 points) Show how the "join" condition was used to generate k-itemsets (this level's itemsets) from frequent (k-1)-itemsets (previous level's frequent itemsets).
2. (10 points) Show how the "subset" condition was used to eliminate candidate itemsets from consideration before unnecessarily counting their support.
3. (10 points) Count support for all remaining itemsets in the level.
4. (5 points) What's the termination condition for this process? Explain.
2. (10 points) What are "lift", "leverage", and "conviction"? Provide an explicit formula for each one of them (look at the Weka code to find those formulas). Pick one association rule from those that you generate in the next part below, and use the values of these metrics for this association rule to judge how interesting/useful this rule is.
3. Take the first frequent itemset of the last level you constructed above.
1. (5 points) Generate all rules from this frequent itemset that have exactly 2 items in the right-hand-side of the rule.
2. (10 points) For each rule, calculate the confidence and the lift of the rule.
4. (5 points) Explain how the processs of mining association rules in Weka's Apriori is performed in terms of the following parameters: lowerBoundMinSupport, upperBoundMinSupport, delta, metricType, minMetric, numRules.

2. Part II. GROUP PROJECT ASSIGNMENT

• Project Instructions: THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.
*** The written report for your group project should be at most 10 pages long (including all graphs, tables, figures, appendices, ...) and the font size should be no smaller than 11 pts. ***

• Data Mining Technique(s): We will run experiment using the following techniques:
• Pre-processing Techniques:
• Feature selection, feature creation, dimensionality reduction, noise reduction, attribute discretization, ...

• Data Mining Techniques:
• Classification Rules: JRip (implementing RIPPER)
Given that this technique is able to handle numeric attributes and missing values directly, make sure to run some experiments with no pre-processing and some experiments with pre-processing, and compare your results. Experiment also with different parameter values to see how they affect the rules produced and and their classification performance.
• Association Rules and Classification Association Rules (CARs): Apriori
Given that this technique does not handle numeric attributes directly, make sure to run experiments with different pre-processing, and compare your results. Based on the results of an experiment, design and apply new preprocessing to improve upon the rules obtained.

• You can consider using advanced techniques to improve the accuracy of your predictions. For instace, you can try ensemble methods (see Section 5.6 of your textbook), ways to deal with inbalanced classification targets (see Section 5.7 of your textbook), etc.

3. Dataset: We will work with the same dataset used in projects 1, 2, and 3. The following 2 files contain the dataset:
Important: For all experiments, perform missing value replacement for the target attribute. Replace the missing values with a new nominal value called "Missing". Or use the dataset that you may have saved for Project 2 as suggested at the beginning of the moderate challenge.

4. Challenges: In each of the following challenges provide a detailed description of the preprocessing techniques used, the motivation for using these techniques, and any hypothesis/intuition gained about the information represented in the dataset. Answer the question provided as well as provide the information described in the PROJECT GUIDELINES.

• Easy Level: This is to be a simple guided experimentation, thus little description is needed for preprocessing techniques.

Construct classification rules using Weka's JRip implementation using the default parameters. Use AYP2012 as the target attribute. Use 10-fold cross-validation to perform an analysis of the classification accuracy.

Construct association rules using Weka's Apriori implementation using the default parameters.

Construct classification association rules using Weka's Apriori implementation using the default parameters, except for car=true. (Lower the min. confidence threshold as needed to obtain at least 10 rules.)

1. Compare the three models generated. Are they different? If so how? If not, why not? Are there any rules in common in any two of these models? Are there any rules in common in all three of these models?
2. Compare the accuracies of the classification rules against the goodness metrics of the association rules.

• Moderate Level: This is a bit more of a challenge (be sure to leave yourself time for challenges 3 and 4).

Use modified parameters and preprocessing techniques to generate a set of JRip classification rules that classifies AYP2012. Provide detailed descriptions about the parameters used to develop your model and/or preprocessing techniques used. One should be able to repeat the experiment from your description. In particular, make sure to experiment with and without rule pruning.

Examine the model. Compare and contrast this model against a ZeroR model, a OneR model, models generated in Projects 2 and 3, and models generated in the challenge above. Answer the following questions in your description about this experiment:

1. Is this model a better model than the other models? If so, why? If not, why not?
2. Did you find that using a dataset with no preprocessing but modified parameters produced a better model that using a preprocessed dataset with default parameters? Explain why or why not.
3. What challenge(s) did you encounter while developing this model? Give an more detailed explanation of how you used preprocessing, postprocessing, or some other technique to overcome a specific challenge.

• WPI Level: This and the WPI+ are the big challenges that should spend the most time on.

Use preprocessing and modified parameters to generate association rules and classification association rules (CARs) that have high values for goodness metrics. Provide detailed descriptions about the parameters used to develop your model and/or preprocessing techniques used. One should be able to repeat the experiment from your description.

1. Describe the association rules produced. Any new or surprising patterns?
2. Examine the CARs model. Compare and contrast this model against a ZeroR model, a OneR model, JRip model above, and models generated in Projects 2 and 3.
3. What challenge(s) did you encounter while developing these two models? Give an more detailed explanation of how you used preprocessing, postprocessing, or some other technique to overcome a specific challenge.

• WPI+ Level:

Design another experiment with a different goal other than the ones that have appeared previously in this assignment. Provide detailed descriptions about the parameters used to develop your model and/or preprocessing techniques used. One should be able to repeat the experiment from your description.

1. What was your motivation for choosing this goal? Is it very useful?
2. Are there any limitations of the dataset that make this a more challenging experiment? Explain.
3. Describe any anomolies that appeared in your model. What might these anomalies mean about the data?

5. Grading sheet for this project.