CS 4445 B Term 2010

Computer Science Department

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2010
Homework and Project 4: Association Rules

PROF. CAROLINA RUIZ

DUE DATES: Friday, Dec. 3, 9:00 am (electronic submission) and 11:00 am (hardcopy submission)

Homework and Project Objectives
Homework Assignment
Project Assignment

HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is multi-fold:

To gain experience with association-rule mining.
To gain additional experience with data mining techniques (and their combinations) from previous projects.

HOMEWORK AND PROJECT ASSIGNMENTS

Readings: Read in great detail Sections 6.1, 6.2, 6.3, 6.7. of your textbook.

This project consists of two parts:

Part I. INDIVIDUAL HOMEWORK ASSIGNMENT

See Solutions to this homework assignment by Yutao Wang.

Consider the zoo.arff dataset converted to arff from the Zoo Data Set available at Univ. of California Irvine KDD Data Repository.

Load this dataset onto Weka. Remove the 1st attribute (animal_name) which is a string. Go to "Associate" and run Apriori with "numRules = 30", "outputItemSets = True", "verbose = True", and default values for the remaining parameters.
Now run Apriori with "numRules = 30", "outputItemSets = True", "verbose = True", treatZeroAsMissing = True", and default values for the remaining parameters.
1. [5 points] What difference do you see between the rules obtained in Parts 1 and 2 above? Explain.
2. [5 points] From now on, consider just the second set of rules (that is, when "treatZeroAsMissing = True"). Find an association rule you find interesting and explain it. Include the confidence and support values in your explanation of the rule.
3. [10 points] What are "lift", "leverage", and "conviction"? Provide an explicit formula for each one of them (look at the Weka code to find those formulas). Use the values of these metrics for the association rule you chose in the previous part to judge how interesting/useful this rule is.
4. Look at the itemsets generated. Let's consider in particular the generation of 5-itemsets from 4-itemsets:
```
Minimum support: 0.35 (35 instances)

...

Size of set of large itemsets L(4): 8

Large Itemsets L(4):
hair=1 milk=1 toothed=1 backbone=1 38
hair=1 milk=1 toothed=1 breathes=1 38
hair=1 milk=1 backbone=1 breathes=1 39
hair=1 toothed=1 backbone=1 breathes=1 38
milk=1 toothed=1 backbone=1 breathes=1 40
milk=1 backbone=1 breathes=1 tail=1 35
toothed=1 backbone=1 breathes=1 legs=4 35
toothed=1 backbone=1 breathes=1 tail=1 38

Size of set of large itemsets L(5): 1

Large Itemsets L(5):
hair=1 milk=1 toothed=1 backbone=1 breathes=1 38
```
  1. [5 points] State what the "join" condition is (called "merge" in the F_k-1xF_k-1 method in your textbook p. 341). Show how the "join" condition was used to generate 5-itemsets from 4-itemsets. (Warning: not all candidate 5-itemsets are shown above.)
  2. [5 points] State what the "subset" condition is (called "candidate pruning" in the F_k-1xF_k-1 method in your textbook p. 341). Show how the "subset" condition was used to eliminate candidate 5-itemsets from consideration before unnecessarily counting their support.
5. [10 points] Consider the following frequent 4-itemset:
```
milk=1 backbone=1 breathes=1 tail=1 
```
  Use Algorithms 6.2 and 6.3 (pp. 351-352), which are based on Theorem 6.2, to construct all rules with Confidence = 100% from this 4-itemset. Show your work by neatly constructing a lattice similar to the one depicted in Figure 6.15 (but you don't need to expand/include pruned rules).
[5 points] Explain how the processs of mining association rules in Weka's Apriori is performed in terms of the following parameters: lowerBoundMinSupport, upperBoundMinSupport, delta, metricType, minMetric, numRules.
[10 points] Exercise 16, p. 411 of the textbook.

Part II. GROUP PROJECT ASSIGNMENT

Project Instructions: THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.
Data Mining Technique(s): We will run experiment using the following techniques:
- Pre-processing Techniques:
  - Feature selection, feature creation, dimensionality reduction, noise reduction, attribute discretization, ... Judicious data pre-processing will be necessary for Association Rule Mining. .
- Data Mining Techniques:
  - Association Rules new for this project
    - (Non-Classification) Association Rules
    - Classification Association Rules
    The new technique for this project is Association Rules. Run several experiments with association rules, but most likely you will need to combine this technique with techniques from previous projects, given the nature of this dataset. Just to name two ways you can do this, you could try to use association rules to select attributes that seem to work together well, or attribute-value pairs that are common or uncommon. Think of other ways of doing so.
  - Classification Rules
    - Prism
    - JRip - Experiment with pruning.
    - PART
    - Decision Table
    - One-R
    - Zero-R
  - Instance-based Learning:
    - K-nearest neighbors
    - Locally Weighted Learning (LWL) - Experiment using it with different classification methods.
    - KStar
  - Decision Trees:
    - ID3, and
    - J4.8.
- Advanced Techniques:
  - You can consider using advanced techniques to improve the accuracy of your predictions. For instace, you can try ensemble methods (see Section 5.6 of your textbook), ways to deal with inbalanced classification targets (see Section 5.7 of your textbook), etc. But, in terms of data mining techniques, this project is restricted to the techniques listed above.
  - Any other creative ideas you have to bust model perfomance and/or to combine different models into a more powerful one.
Dataset: We will work with the KDD Cup 1999 Data Set. This dataset contains about 5 million data instances. You should use as much data as you can from this dataset. You can also focus on the 10% of this dataset contained in kddcup.data_10_percent.gz (or as much of this 10% dataset as your computer memory allows). Note that the target classification attribute (let's call it attack_type) is nominal. Its values appear in the first line of kddcup.names, and it is the last column of kddcup.data_10_percent.gz.
Challenge: The objective of this project is to construct a model with the highest prediction accuracy possible for each of the following challenges. You are NOT required to work on all the challenges: you must work on Challenge 5 (both boolean and multivalued); and if you wish, you can also work in one of the Challenges 1-4 (or your own combination of them).
1. Challenge 1 (multivalued): Predicting the type of attack (including "normal"). That is, using all the given values for the classification attribute: back,buffer_overflow,ftp_write,guess_passwd,imap,ipsweep,land,loadmodule, multihop,neptune,nmap,normal,perl,phf,pod,portsweep,rootkit,satan,smurf,spy, teardrop,warezclient,warezmaster.
  Cost Sensitive Classification. See Textbook Slides - Chapter 4 pp. 74-82. We'll use cost sensitive classification available in Weka under "More Option" in the Classification tab (under "Test options"). Let's make the cost of misclassifying any type of attack DIFFERENT FROM neptune, smurf, and normal, 10 times more costly than misclassififying neptune, smurf, and normal attacks. That is, make the cost of misclassifying each neptune, smurf, and normal attacks equal to "1", and the cost of misclassifying other attacks equal to 10.
2. Challenge 2 (aggregated): In this challenge, we consider the following 6 values for the attack_type attribute: normal, neptune, smurf, nmap, warezclient, and other; where "other" aggregates all the remaining types of attack. Look at the missclassifications made by your model and then propose and use an appropriate cost-sensitive matrix to correct misclassifications.
3. Challenge 3 (coarse-grained): In this challenge, we consider only 5 values for the classification target: normal, probe, dos, u2r, and r2l. Convert your multivaluate target attribute using the mapping available at: http://archive.ics.uci.edu/ml/databases/kddcup99/training_attack_types . For the cost matrix, use the one provided in the scoring awk script which uses the following denomination of the target values:
```
0 normal
1 probe
2 denial of service (DOS)
3 user-to-root (U2R)
4 remote-to-local (R2L) 
```
  as described in the KDD Cup 1999: Results.
4. Challenge 4 (generalization): In this challenge, you train on the coarse-grained dataset, but test on the "test data with corrected labels" that contains new type of attacks, available at the KDD Cup 1999: General Information webpage. Use the mapping of attacks to categories given in the categorization awk script available at the KDD Cup 1999: Results webpage. Since you are testing on a separate test set, you do not need to use 10-fold cross-validation for this challenge. For the cost matrix, use the one provided in scoring awk script as described in the KDD Cup 1999: Results.
5. Challenge 5 (NSL-KDD): In this challenge, you use the training and testing data provided by The NSL-KDD Data Set based on the paper: A Detailed Analysis of the KDD CUP 99 Data Set by Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali A. Ghorbani.
  - For the Boolean challenge, train on KDDTrain+.ARFF and test on KDDTest-21.ARFF (no need to test when using association rules).
  - For the multivalued challenge, train on KDDTrain+.TXT and test on KDDTest-21.TXT (no need to test when using association rules).
  Since you are testing on a separate test set, you do not need to use 10-fold cross-validation for this challenge.

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2010 Homework and Project 4: Association Rules

PROF. CAROLINA RUIZ

HOMEWORK AND PROJECT OBJECTIVES

HOMEWORK AND PROJECT ASSIGNMENTS

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2010
Homework and Project 4: Association Rules