CS 4445 B Term 2010

Computer Science Department

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2010
Homework and Project 3: Classification Rules and Instance Based Learning

PROF. CAROLINA RUIZ

DUE DATES: Tuesday, Nov. 23, 8:00 am (electronic submission) and 11:00 am (hardcopy submission)

Homework and Project Objectives
Homework Assignment
Project Assignment

HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is multi-fold:

To gain experience with rule-based classifiers.
To gain experience with instance-based classifiers.
To gain experience with cost-sensitive classification.

HOMEWORK AND PROJECT ASSIGNMENTS

Readings: Read in great detail Sections 5.1 and 5.2 of your textbook.

This project consists of two parts:

Part I. INDIVIDUAL HOMEWORK ASSIGNMENT

See Solutions to this homework assignment by Yutao Wang.

Consider the following dataset, adapted from the Shuttle Landing Control Data Set available at the The University of California Irvine (UCI) Machine Learning Data Repository. Visit those webpages above to learn more about this dataset.
```
@relation shuttle-landing-control

@attribute STABILITY	continuous
@attribute ERROR	continuous
@attribute WIND 	{head,tail}
@attribute VISIBILITY	{yes, no}
@attribute Class 	{noauto,auto}

@data
( 1) 60, 0.5, tail, no,  auto
( 2) 75, 1.0, head, yes, noauto
( 3) 40, 0.9, head, no,  auto
( 4) 65, 0.0, head, no,  auto
( 5) 45, 0.2, head, yes, auto
( 6) 80, 0.1, tail, yes, noauto
( 7) 30, 0.4, head, yes, noauto
( 8) 90, 0.6, head, no,  auto
( 9) 65, 0.1, head, no,  auto
(10) 85, 0.5, head, yes, noauto
(11) 25, 0.6, tail, yes, auto
(12) 40, 0.4, tail, yes, noauto
(13) 15, 0.6, tail, yes, noauto
(14) 25, 0.8, head, yes, noauto
(15) 30, 0.2, head, yes, auto
(16) 35, 0.4, head, yes, noauto
(17) 70, 0.6, tail, no,  auto
(18) 20, 0.5, tail, yes, auto
(19) 75, 0.1, tail, no,  auto
(20) 80, 0.2, head, yes, noauto
(21) 85, 0.8, tail, yes, noauto
(22) 60, 0.9, tail, yes, noauto
```
(50 points) Classification Rules
[See solutions to a similar problem from a previous offering of this course.]
In this part, you will construct classification rules using the sequential covering algorithm (called Prism in Weka). Note that the dataset contains continuous attributes. Handle those continuous attributes as J4.8 would handle them, that is using binary splits. To reduce the amount of work, consider only the following split points:
```
   split point for STABILITY: 50
   split points for ERROR:    0.3 and 0.7
```
regardless of what values for those attributes are present in the (subset of the) dataset under consideration. That is, the only predicates that can appear in the rules are: STABILITY ≤ 50, STABILITY > 50, ERROR ≤ 0.3, ERROR > 0.3, ERROR ≤ 0.7, ERROR > 0.7.
1. Assume that the algorithm has produced the following two rules for Class=noauto so far:
```
If ERROR > 0.7
   and VISIBILITY = yes then noauto
If VISIBILITY = yes
   and STABILITY > 50 then noauto
```
  Starting from here, follow the sequential covering algorithm to construct "by hand" the 3rd rule for Class=noauto.
  1. (5 points) List the instances that are under consideration during the construction of this 3rd rule for Class=noauto.
  2. (10 points) Now, construct the 3rd rule using those instances. Use the ratio p/t to rank the attribute-values that are candidates for inclusion in a rule. Your written solutions should show all your work. That is, at each stage during the construction of this rule, list all the attribute-value pairs (together with their p/t ratios) that are candidates for inclusion in the rule, which one was selected, and why.
2. Assume now that the algorithm has produced ALL the rules for Class=noauto. (You only have to construct the 3rd one, as described above, not all.) Starting from this point, follow the sequential covering algorithm to construct "by hand" now the 1st rule for Class=auto. Follow the same steps as before:
  1. (5 points) List the instances that are under consideration during the construction of this 1st rule for Class=auto.
  2. (10 points) Now, construct this 1st rule using those instances. At each stage during the construction of this rule, list all the attribute-value pairs (together with their p/t ratios) that are candidates for inclusion in the rule, which one was selected, and why.
3. (20 points)Rule Prunning In this part, you will investigate how the RIPPER algorithm prunes a rule using a validation set. See Section 5.1 pp.220-221 of your textbook, and the JRip method in Weka, under Classification Rules (click "More" to see an algorithmic description of the method and also read the Weka code implementing it).
  Given the rule
```
If VISIBILITY = yes
   and ERROR ≤ 0.7 
   and ERROR >  0.3 
   and STABILITY ≤ 50
   and WIND = tail 
then Class=noauto
```
  and the validation set:
```
      STABILITY  ERROR   WIND   VISIBILITY  Class
(v1 ) 35,        0.1,    head,  no,  	    auto
(v2 ) 80,        0.6,    tail,  yes,        noauto
(v3 ) 35,        0.1,    head,  no,  	    auto
(v4 ) 10,        0.6,    tail,  yes,        noauto
(v5 ) 40,        0.5,    tail,  yes,        auto
(v6 ) 80,        0.6,    tail,  yes,        noauto
(v7 ) 25,        0.4,    tail,  yes,        auto
(v8 ) 80,        0.6,    tail,  yes,        auto
(v9 ) 20,        0.6,    tail,  yes,        noauto
(v10) 35,        0.1,    head,  no,  	    auto
(v11) 40,        0.5,    head,  yes,        noauto
(v12) 15,        0.4,    head,  yes,        noauto
```
  show each step of the pruning method used by RIPPER on this rule over the above validation set. Show your work.
(50 points) Instance-Based Learning
[See solutions to a similar problem from a previous offering of this course.] Assume that we want to predict the Class attribute (prediction target) of the following two new data instances:
```
     STABILITY  ERROR   WIND   VISIBILITY
(23) 35,        0.1,    head,  no  
(24) 80,        0.6,    tail,  yes
```
using the k-nearest neighbors algorithm on the same training set of 22 instances above.
For each of the variations of the k-nearest neighbors algorithm listed below do:
- (5 points per variation) Implement a program/script that uses the described distance (similarity) metric to find the 4 nearest neighbors of each of the two instances above. Make the output of your code as verbose as possible so that we can see the work it does. Include this output as well as the code you wrote in your report. Two different values of a nominal attribute will contribute 1 towards the Euclidean distance, and two "same values" of a nominal attribute will contribute 0 towards the Euclidean distance of two data instances.
- (5 points per variation) Use those 4 nearest neighbors to classify each of the two given instances (23) and (24) using the described "voting" methods below. Show your work in your report.
Variations:
1. - Data Attributes: Use the data attributes as provided.
  - Distance metric: (Plain) Euclidean distance.
  - "Voting" method: Majority vote with no distance weighting.
2. - Data Attributes: Use the data attributes as provided.
  - Distance metric: (Plain) Euclidean distance.
  - "Voting" method: Majority vote using distance weighting: Use the 4 nearest neighbors weighted by the inverse of the distance. That is, if their respective distances to the test instance are d1, d2, d3, and d4, then the weights of the 4 nearest neighbors are w1 = 1/d1, w2 = 1/d2, w3 = 1/d3, and w4 = 1/d4 for the weighted majority vote.
3. - Data Attributes: Preprocess STABILITY so that it ranges from 0 to 1. That is, replace each STABILITY value with (value - min. value)/(max. value - min. value), where min. value and max. value are the minimum and maximum values of that attribute respectively. Apply the same tranformation to the STABILITY values of the test instances.
  - Distance metric: (Plain) Euclidean distance.
  - "Voting" method: Majority vote with no distance weighting.
4. - Data Attributes: Preprocess STABILITY so that it ranges from 0 to 1, as described in Variation 3.
  - Distance metric: (Plain) Euclidean distance.
  - "Voting" method: Majority vote with distance weighting, as described in Variation 2.
5. - Data Attributes: Preprocess STABILITY so that it ranges from 0 to 1, as described in Variation 3.
  - Distance metric: Weighted Euclidean distance. Use the following weights for each attribute:
```
STABILITY    2
ERROR        1
WIND         4.5 
VISIBILITY   3 
```
  - "Voting" method: Majority vote with distance weighting, as described in Variation 2.

Part II. GROUP PROJECT ASSIGNMENT

Project Instructions: THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.
Data Mining Technique(s): We will run experiment using the following techniques:
- Pre-processing Techniques:
  - Feature selection, feature creation, dimensionality reduction, noise reduction, attribute discretization, ...
- Data Mining Techniques: More emphasis should be placed on the new methods for this project, but these methods can be combined with methods from previous projects.
  - Classification Rules
    - Prism
    - JRip - Experiment with pruning.
    - PART
    - Decision Table
    - One-R
    - Zero-R
  - Instance-based Learning:
    - K-nearest neighbors
    - Locally Weighted Learning (LWL) - Experiment using it with different classification methods.
    - KStar
  - Decision Trees:
    - ID3, and
    - J4.8.
- Advanced Techniques:
  - You can consider using advanced techniques to improve the accuracy of your predictions. For instace, you can try ensemble methods (see Section 5.6 of your textbook), ways to deal with inbalanced classification targets (see Section 5.7 of your textbook), etc. But, in terms of data mining techniques, this project is restricted to the techniques listed above.
  - Any other creative ideas you have to bust model perfomance and/or to combine different models into a more powerful one.
Dataset: We will work with the KDD Cup 1999 Data Set. This dataset contains about 5 million data instances. You should use as much data as you can from this dataset. You can also focus on the 10% of this dataset contained in kddcup.data_10_percent.gz (or as much of this 10% dataset as your computer memory allows). Note that the target classification attribute (let's call it attack_type) is nominal. Its values appear in the first line of kddcup.names, and it is the last column of kddcup.data_10_percent.gz.
Challenge: The objective of this project is to construct a model with the highest prediction accuracy possible for each of the following challenges. You are not required to work on all the challenges, but work at least on challenges 2 and 3. Many thanks to Michael Oliver, John Schaeffer, and Harry Moreno for their excellent suggestions on how to make these challenges more interesting. I've incorporated several of their ideas below:
1. Challenge 1 (multivalued): Predicting the type of attack (including "normal"). That is, using all the given values for the classification attribute: back,buffer_overflow,ftp_write,guess_passwd,imap,ipsweep,land,loadmodule, multihop,neptune,nmap,normal,perl,phf,pod,portsweep,rootkit,satan,smurf,spy, teardrop,warezclient,warezmaster.
  - New for this project: Cost Sensitive Classification. See Textbook Slides - Chapter 4 pp. 74-82. We'll use cost sensitive classification available in Weka under "More Option" in the Classification tab (under "Test options"). Let's make the cost of misclassifying any type of attack DIFFERENT FROM neptune, smurf, and normal, 10 times more costly than misclassififying neptune, smurf, and normal attacks. That is, make the cost of misclassifying each neptune, smurf, and normal attacks equal to "1", and the cost of misclassifying other attacks equal to 10.
2. Challenge 2 (aggregated): In this challenge, we consider the following 6 values for the attack_type attribute: normal, neptune, smurf, nmap, warezclient, and other; where "other" aggregates all the remaining types of attack. Look at the missclassifications made by your model and then propose and use an appropriate cost-sensitive matrix to correct misclassifications.
3. Challenge 3 (coarse-grained): In this challenge, we consider only 5 values for the classification target: normal, probe, dos, u2r, and r2l. Convert your multivaluate target attribute using the mapping available at: http://archive.ics.uci.edu/ml/databases/kddcup99/training_attack_types . For the cost matrix, use the one provided in the scoring awk script which uses the following denomination of the target values:
```
0 normal
1 probe
2 denial of service (DOS)
3 user-to-root (U2R)
4 remote-to-local (R2L) 
```
  as described in the KDD Cup 1999: Results.
4. Challenge 4 (generalization): In this challenge, you train on the coarse-grained dataset, but test on the "test data with corrected labels" that contains new type of attacks, available at the KDD Cup 1999: General Information webpage. Use the mapping of attacks to categories given in the categorization awk script available at the KDD Cup 1999: Results webpage. Since you are testing on a separate test set, you do not need to use 10-fold cross-validation for this challenge. For the cost matrix, use the one provided in scoring awk script as described in the KDD Cup 1999: Results.
5. Challenge 5 (NSL-KDD): In this challenge, you use the training and testing data provided by The NSL-KDD Data Set based on the paper: A Detailed Analysis of the KDD CUP 99 Data Set by Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali A. Ghorbani. They provide a boolean and a multivalued versions of the dataset, as well as test sets with and without instances that are too easy to predict. Since you are testing on a separate test set, you do not need to use 10-fold cross-validation for this challenge.

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2010 Homework and Project 3: Classification Rules and Instance Based Learning

PROF. CAROLINA RUIZ

HOMEWORK AND PROJECT OBJECTIVES

HOMEWORK AND PROJECT ASSIGNMENTS

(50 points) Classification Rules

(50 points) Instance-Based Learning

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2010
Homework and Project 3: Classification Rules and Instance Based Learning