CS 4445 B Term 2014

Computer Science Department

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2014
Homework and Project 4: Classification Rules and Association Rules

Prof. Carolina Ruiz

DUE DATES: Monday, Dec. 8th, 9:00 am (electronic submission) and 11:00 am (hardcopy submission)

Homework and Project Objectives
Homework Assignment
Project Assignment

HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is multi-fold:

To gain experience with the construction and evaluation of classification rules.
To gain experience with the construction and evaluation of association rules.

HOMEWORK AND PROJECT ASSIGNMENTS

Readings: Read Sections 5.1, 6.1, 6.2, 6.3, 6.7 of your textbook in great detail.

This project consists of two parts:

Part I. GROUP HOMEWORK ASSIGNMENT
See solutions to this homework:

Chiying Wang's solutions to the Classification Rules using RIPPER problem.
Artem Gritsenko's solutions to the Association Rules using Apriori problem.

Consider the following cars dataset. This dataset was constructed by taking a small sample of data instances and attributes from the Car Evaluation Dataset available at the UCI Data Repository.
@relation 'cars-weka.filters.unsupervised.attribute.Remove-R3,5-weka.filters.unsu pervised.instance.Resample-S1-Z2.0' @attribute buying {vhigh,high,med,low} @attribute maint {vhigh,high,med,low} @attribute persons {2,4,more} @attribute safety {low,med,high} @attribute class {unacc,acc,good} @data med,vhigh,more,low,unacc med,vhigh,2,med,unacc vhigh,vhigh,more,med,unacc med,high,4,low,unacc high,med,4,high,good low,med,2,med,unacc low,high,2,high,unacc low,vhigh,more,med,acc med,vhigh,4,med,acc med,vhigh,4,med,acc vhigh,vhigh,4,med,unacc med,med,more,med,acc med,vhigh,2,med,unacc med,med,4,low,unacc med,vhigh,more,low,unacc med,low,4,med,acc high,low,2,high,unacc high,med,4,low,unacc med,low,4,low,unacc high,high,4,low,unacc low,med,4,high,good low,low,2,high,unacc

Classification Rules using RIPPER:
You can choose between two possibilities:
Option 1: Solve part (a) below by hand. In this case you have to answer part (b) below too.
Option 2: Write your own code to solve part (a) below. In this case you don't have to answer part (b) below. Submit your code by email (with instructions on how to run it) and it will be read for correctness and tested. Your code will be graded over 25 points (separately from the score you receive in part (a)).

Use Weka, Excel, your own code (in a programming language of your choice), or other application to help you calculate the metrics used by RIPPER.

(20 points) Construct the first rule that RIPPER would contruct over this dataset. Show your work and explain each step of the process.
(10 points) Describe in words the process that RIPPER would follow to prune this rule. (No need for you to prune the rule, you just need to describe what RIPPER would do.)

Existing real-world application of Classification Rules:
Each group should search for an existing real-world successful application of Classification Rules. This sucessful story should be about mining these rule-based models from data to discover novel and useful patterns that have made a difference in a certain industry or field. The application domain is up to the group (e.g., finance, sports, healthcare, science, ...). The application must be a current application (ideally not older than 5 years).
Your homework report for this part should contain (at most 1 page):

(10 points) Name and description of the chosen application.
(10 points) Detailed description of how this application uses Classification Rules.
(5 points) List of (recent) references and sources that you used to identify and investigate this application.

Association Rules using Apriori:
You can choose between two possibilities:
Option 1: Solve part (i) below by hand. In this case you have to answer part (ii) below too.
Option 2: Write your own code to solve part (i) below. In this case you don't have to answer part (ii) below. Submit your code by email (with instructions on how to run it) and it will be read for correctness and tested. Your code will be graded over 25 points (separately from the score you receive in part (i)).

Use Weka, Excel, your own code (in a programming language of your choice), or other application to help you calculate support of the itemsets constructed below. If you write your own correct, well documented code to solve this part of homework, you will ...

Generate all frequent itemsets over this dataset, following the Apriori algorithm level by level. Show each step of the process. Let min-support-count = 3 instances (that is, min-support about 13%).

(5 points) State what the "join" condition is (called "merge" in the Fk-1xFk-1 method in your textbook p. 341).
(5 points) State what the "subset" condition is (called "candidate pruning" in the Fk-1xFk-1 method in your textbook p. 341).
For each level,

(10 points) Show how the "join" condition was used to generate k-itemsets (this level's itemsets) from frequent (k-1)-itemsets (previous level's frequent itemsets).
(10 points) Show how the "subset" condition was used to eliminate candidate itemsets from consideration before unnecessarily counting their support.
(10 points) Count support for all remaining itemsets in the level.

(5 points) What's the termination condition for this process? Explain.

(10 points) What are "lift", "leverage", and "conviction"? Provide an explicit formula for each one of them (look at the Weka code to find those formulas). Pick one association rule from those that you generate in the next part below, and use the values of these metrics for this association rule to judge how interesting/useful this rule is.
Take the first frequent itemset of the last level you constructed above.

(5 points) Generate all rules from this frequent itemset that have exactly 2 items in the right-hand-side of the rule.
(10 points) For each rule, calculate the confidence and the lift of the rule.

(5 points) Explain how the processs of mining association rules in Weka's Apriori is performed in terms of the following parameters: lowerBoundMinSupport, upperBoundMinSupport, delta, metricType, minMetric, numRules.

Existing real-world application of Association Rules:
Each group should search for an existing real-world successful application of Association Rules. This sucessful story should be about mining association rules from data to discover novel and useful patterns that have made a difference in a certain industry or field. The application domain is up to the group (e.g., finance, sports, healthcare, science, ...). The application must be a current application (ideally not older than 5 years).
Your homework report for this part should contain (at most 1 page):

(10 points) Name and description of the chosen application.
(10 points) Detailed description of how this application uses Association Rules.
(5 points) List of (recent) references and sources that you used to identify and investigate this application.
Part II. GROUP PROJECT ASSIGNMENT
- Project Instructions: THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports. Note that the project guidelines specify strict page limits for your written report.
- Data Mining Technique(s): We will run experiment using the following techniques:
  - Pre-processing Techniques:
    - Feature selection, feature creation, dimensionality reduction, noise reduction, attribute discretization, ...
  - Data Mining Techniques:
    - Classification Rules: JRip (implementing RIPPER)
      Given that this technique is able to handle numeric attributes and missing values directly, make sure to run some experiments with no pre-processing and some experiments with pre-processing, and compare your results. Experiment also with different parameter values to see how they affect the rules produced and and their classification performance.
    - Association Rules and Classification Association Rules (CARs): Apriori
      Given that this technique does not handle numeric attributes directly, make sure to run experiments with different pre-processing, and compare your results. Based on the results of an experiment, design and apply new preprocessing to improve upon the rules obtained.
  - Advanced Techniques:
    - You can consider using advanced techniques to improve the accuracy of your predictions. For instace, you can try ensemble methods (see Section 5.6 of your textbook), ways to deal with inbalanced classification targets (see Section 5.7 of your textbook), etc.
- Dataset: We will work with the same dataset used in projects 1, 2, and 3: Census-Income (also known as "Adult") Dataset available from the Univ. California Irvine (UCI) Machine Learning Repository.
  In particular,
  - Use the data in the file "adult.data", and the description of the data in the file "adult.names".
  - Use the nominal attribute salary (called >50K, .50K in the data files) as the classification target.
  - remove the fnlwgt attribute from the dataset.
- Challenges: In each of the following challenges provide a detailed description of the preprocessing techniques used, the motivation for using these techniques, and any hypothesis/intuition gained about the information represented in the dataset. Answer the question provided as well as provide the information described in the PROJECT GUIDELINES.
  - Easy Level: This is to be a simple guided experimentation, thus little description is needed for preprocessing techniques.
    Construct classification rules using Weka's JRip implementation using the default parameters. Use salary as the target attribute. Use 10-fold cross-validation to perform an analysis of the classification accuracy.
    Construct association rules using Weka's Apriori implementation using the default parameters.
    Construct classification association rules using Weka's Apriori implementation using the default parameters, except for car=true. (Lower the min. confidence threshold as needed to obtain at least 10 rules.)
    Examine your three sets of rules. Compare and contrast them. Answer the following questions in your description about this experiment:
    1. Compare the three models generated. Are they different? If so how? If not, why not? Are there any rules in common in any two of these models? Are there any rules in common in all three of these models?
    2. Compare the accuracies of the classification rules against the goodness metrics of the association rules.
  - Moderate Level: This is a bit more of a challenge.
    Use modified parameters and preprocessing techniques to generate a set of JRip classification rules that classifies salary. Provide detailed descriptions about the parameters used to develop your model and/or preprocessing techniques used. One should be able to repeat the experiment from your description. In particular, make sure to experiment with and without rule pruning.
    Examine the model. Compare and contrast this model against a ZeroR model, a OneR model, models generated in Projects 2 and 3, and models generated in the challenge above. Answer the following questions in your description about this experiment:
    1. Is this model a better model than the other models? If so, why? If not, why not?
    2. Did you find that using a dataset with no preprocessing but modified parameters produced a better model that using a preprocessed dataset with default parameters? Explain why or why not.
    3. What challenge(s) did you encounter while developing this model? Give an more detailed explanation of how you used preprocessing, postprocessing, or some other technique to overcome a specific challenge.
  - WPI Level:
    Design another experiment with a different goal other than the ones that have appeared previously in this assignment. Provide detailed descriptions about the parameters used to develop your model and/or preprocessing techniques used. One should be able to repeat the experiment from your description.
    1. What was your motivation for choosing this goal? Is it very useful?
    2. Are there any limitations of the dataset that make this a more challenging experiment? Explain.
    3. Describe any anomolies that appeared in your model. What might these anomalies mean about the data?

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2014 Homework and Project 4: Classification Rules and Association Rules

Prof. Carolina Ruiz

HOMEWORK AND PROJECT OBJECTIVES

HOMEWORK AND PROJECT ASSIGNMENTS

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2014
Homework and Project 4: Classification Rules and Association Rules