CS 4445 B Term 2010

Computer Science Department

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2010
Homework and Project 5: Clustering and Anomaly Detection

PROF. CAROLINA RUIZ

DUE DATES: Tuesday, Dec. 14, 9:00 am (electronic submission) and 11:00 am (hardcopy submission)

Homework and Project Objectives
Homework Assignment
Project Assignment

HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is multi-fold:

To gain experience with clustering and anomaly detection.
To gain additional experience with data mining techniques (and their combinations) from previous projects.

HOMEWORK AND PROJECT ASSIGNMENTS

Readings: Read in great detail Sections 8.1-8.5 and Chapter 10 of your textbook.

This project consists of two parts:

Part I. INDIVIDUAL HOMEWORK ASSIGNMENT

Consider the KDDTest-21.txt from the NSL-KDD Data Set webpage.
- [10 points] Use any anomaly detection approach(es) you wish to identify 3 outliers in this dataset. Include these outliers in your report, and explain in detail how you found them and why they are outliers.
- [30 points] As you know, the above test dataset contains attack-types that are not present in the training set KDDTrain+.TXT. The attack-types present in the test set but not in the training set that you need to consider are:
```
apache2
httptunnel
mailbomb
mscan
processtable
saint
snmpgetattack
snmpguess
```
  [There are other attack-types that appear in the test and not in the training set, but they are disregarded here because they are very infrequent.] Use clustering algorithms (e.g., Simple K-means and/or Hierarchical Clustering [make sure to experiment with different "linkType"s] implemented in the Weka system) to determine if any of the above attack-types are similar to other attack-types that do appear both in the training and the test datasets, which are listed below:
```
back
buffer_overflow
ftp_write
guess_passwd
imap
ipsweep
land
loadmodule
multihop
neptune
nmap
normal
perl
phf
pod
portsweep
rootkit
satan
smurf
spy
teardrop
warezclient
warezmaster
```
  For this, you can use just the test set alone, or the test set and the training set combined. Explain your work in detail.
[10 points] Exercise 16, p. 563 of the textbook. Show your work.
[10 points] Exercise 17, p. 563 of the textbook. Show your work.
[10 points] Exercise 32, p. 567 of the textbook. Explain your answer in detail.

Part II. GROUP PROJECT ASSIGNMENT

Project Instructions: THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.
Data Mining Technique(s): Run experiments using any (combinations) of the following techniques:
- Pre-processing Techniques:
  - Feature selection, feature creation, dimensionality reduction, noise reduction, attribute discretization, ... .
- Data Mining Techniques:
  - Clustering (new for this project)
    - Simple K-means
    - HierarchicalClusterer: Make sure to experiment with different "linkType"s
  - Anomaly Detection (new for this project)
  - Association Rules
    - (Non-Classification) Association Rules
    - Classification Association Rules
  - Classification Rules
    - Prism
    - JRip - Experiment with pruning.
    - PART
    - Decision Table
    - One-R
    - Zero-R
  - Instance-based Learning:
    - K-nearest neighbors
    - Locally Weighted Learning (LWL) - Experiment using it with different classification methods.
    - KStar
  - Decision Trees:
    - ID3, and
    - J4.8.
- Advanced Techniques:
  - You should consider using advanced techniques to improve the accuracy of your predictions. For instace, try ensemble methods (see Section 5.6 of your textbook), ways to deal with inbalanced classification targets (see Section 5.7 of your textbook), cost-sensitive classification, etc. But, in terms of data mining techniques, this project is restricted to the techniques listed above.
  - Any other creative ideas you have to bust model perfomance and/or to combine different models into a more powerful one.
Dataset and Challenge: In this challenge, we'll use the NSL-KDD Data Set based on the paper: A Detailed Analysis of the KDD CUP 99 Data Set by Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali A. Ghorbani.
The objective of this project is to construct a model with the highest prediction accuracy possible for the Boolean challenge, and a model with the highest prediction accuracy possible for the multivalued challenge described below. Please submit each of these two models by email to us as part of your project submission. The more accurate, creative, and well-designed your solution is, the better. Remember to include as much domain knowledge as you can.
- For the Boolean challenge, train on KDDTrain+.ARFF and test on KDDTest-21.ARFF. Remember to eliminate the "difficulty" attributes in the two datasets.
- For the multivalued challenge, use the datasets provided below. I constructed those datasets from KDDTrain+.TXT and KDDTest-21.TXT by eliminating the difficulty attribute in both of them, and by removing from the test set those instances with an attack-type not in the training set:
  - KDDTrain+_no_difficulty.arff
  - KDDTest-21_only_training_attacks_no_difficulty.arff
Since you are testing on a separate test set, you do not need to use 10-fold cross-validation for this challenge.

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2010 Homework and Project 5: Clustering and Anomaly Detection

PROF. CAROLINA RUIZ

HOMEWORK AND PROJECT OBJECTIVES

HOMEWORK AND PROJECT ASSIGNMENTS

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2010
Homework and Project 5: Clustering and Anomaly Detection