WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2010 
Homework and Project 5: Clustering and Anomaly Detection

PROF. CAROLINA RUIZ 

DUE DATES: Tuesday, Dec. 14, 9:00 am (electronic submission) and 11:00 am (hardcopy submission) 
------------------------------------------


HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is multi-fold:

HOMEWORK AND PROJECT ASSIGNMENTS

Readings: Read in great detail Sections 8.1-8.5 and Chapter 10 of your textbook.

This project consists of two parts:

  • Part I. INDIVIDUAL HOMEWORK ASSIGNMENT

    1. Consider the KDDTest-21.txt from the NSL-KDD Data Set webpage.

      • [10 points] Use any anomaly detection approach(es) you wish to identify 3 outliers in this dataset. Include these outliers in your report, and explain in detail how you found them and why they are outliers.

      • [30 points] As you know, the above test dataset contains attack-types that are not present in the training set KDDTrain+.TXT. The attack-types present in the test set but not in the training set that you need to consider are:
        apache2
        httptunnel
        mailbomb
        mscan
        processtable
        saint
        snmpgetattack
        snmpguess
        
        [There are other attack-types that appear in the test and not in the training set, but they are disregarded here because they are very infrequent.] Use clustering algorithms (e.g., Simple K-means and/or Hierarchical Clustering [make sure to experiment with different "linkType"s] implemented in the Weka system) to determine if any of the above attack-types are similar to other attack-types that do appear both in the training and the test datasets, which are listed below:
        back
        buffer_overflow
        ftp_write
        guess_passwd
        imap
        ipsweep
        land
        loadmodule
        multihop
        neptune
        nmap
        normal
        perl
        phf
        pod
        portsweep
        rootkit
        satan
        smurf
        spy
        teardrop
        warezclient
        warezmaster
        
        For this, you can use just the test set alone, or the test set and the training set combined. Explain your work in detail.

    2. [10 points] Exercise 16, p. 563 of the textbook. Show your work.

    3. [10 points] Exercise 17, p. 563 of the textbook. Show your work.

    4. [10 points] Exercise 32, p. 567 of the textbook. Explain your answer in detail.

  • Part II. GROUP PROJECT ASSIGNMENT