CS 4445 B Term 2010

Computer Science Department

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2010
Homework and Project 2: Data Pre-processing, Mining, and Evaluation of Decision Trees

PROF. CAROLINA RUIZ

DUE DATES: Tuesday, Nov. 16, 8:00 am (electronic submission) and 11:00 am (hardcopy submission)

Homework and Project Objectives
Homework Assignment
Project Assignment

HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is multi-fold:

To gain experience pre-processing datasets to clean, normalize, and discretize data attributes, and, when needed, reduce the dimensionality of the data.
To gain experience with the construction of decision trees.
To gain experience with the evaluation of the models/patterns constructed with a data mining technique.
To gain familiarity with the Weka system, its GUI, its code, and its input data format (arff).

HOMEWORK AND PROJECT ASSIGNMENTS

Readings: Read in great detail Chapter 4 of your textbook.

This project consists of two parts:

Part I. INDIVIDUAL HOMEWORK ASSIGNMENT
See Solutions by Hao Wan.
Consider the following dataset, adapted from the Shuttle Landing Control Data Set available at the The University of California Irvine (UCI) Machine Learning Data Repository. Visit those webpages above to learn more about this dataset.
@relation shuttle-landing-control @attribute STABILITY {stab, xstab} @attribute ERROR {LX, MM, SS} @attribute WIND {head,tail} @attribute VISIBILITY {yes, no} @attribute Class {noauto,auto} @data ( 1) xstab, MM, tail, no, auto ( 2) xstab, LX, head, yes, noauto ( 3) stab, LX, head, no, auto ( 4) xstab, SS, head, no, auto ( 5) stab, SS, head, yes, auto ( 6) xstab, SS, tail, yes, noauto ( 7) stab, MM, head, yes, noauto ( 8) xstab, MM, head, no, auto ( 9) xstab, SS, head, no, auto (10) xstab, MM, head, yes, noauto (11) stab, MM, tail, yes, auto (12) stab, MM, tail, yes, noauto (13) stab, MM, tail, yes, noauto (14) stab, LX, head, yes, noauto (15) stab, SS, head, yes, auto (16) stab, MM, head, yes, noauto (17) xstab, MM, tail, no, auto (18) stab, MM, tail, yes, auto (19) xstab, SS, tail, no, auto (20) xstab, SS, head, yes, noauto (21) xstab, LX, tail, yes, noauto (22) xstab, LX, tail, yes, noauto

(20 points) Complete the construction of the full ID3 decision tree shown below using entropy to rank the predictive attributes (STABILITY, ERROR, WIND, VISIBILITY) with respect to the target/classification attribute (Class).
Note that the root of the decision tree (VISIBILITY) is given and you do NOT need to do the gain or entropy calculations needed to determine that this attribute should be the root node. However, you need to show all the steps of the calculations from there on. Make sure you compute log in base b (for the appropriate b) correctly as some calculators don't have a log_b primitive for all b's. Also, state explicitly in your tree what instances exactly belong to each tree node using the line numbers provided next to each data instance in the dataset above.

VISIBILITY / \ yes / \ no / \ ? ?

(5 points) Propose approaches to using your decision tree above to classify instances that contain missing values. Use the following instance to illustrate your ideas.
VISIBILITY = yes, STABILITY = stab, ERROR = ?, WIND = head

Study how J4.8 performs post-prunning by reading in detail:

your textbook
the HW1 solutions B term 2006
the HW1 solutions A term 2004.
Part II. GROUP PROJECT ASSIGNMENT
- Project Instructions: THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.
- Data Mining Technique(s): We will run experiment using the following techniques:
  - Pre-processing Techniques:
    - Feature selection, feature creation, dimensionality reduction, noise reduction, attribute discretization, ...
  - Data Mining Techniques:
    - Zero-R
    - One-R
    - Decision Trees:
      - ID3, and
      - J4.8. Given that J4.8 is able to handle numeric attributes and missing values directly, make sure to run some experiments with no pre-processing and some experiments with pre-processing, and compare your results. Experiment also with pre- and post-prunning of the J4.8 decision tree to see if they increase the classification accuracy.
  - Advanced Techniques:
    - You can consider using advanced techniques to improve the accuracy of your predictions. For instace, you can try ensemble methods (see Section 5.6 of your textbook), ways to deal with inbalanced classification targets (see Section 5.7 of your textbook), etc. But, in terms of data mining techniques, this project is restricted to Zero-R, One-R, and decisions trees.
- Dataset: We will work with the KDD Cup 1999 Data Set. This dataset contains about 5 million data instances. You should use as much data as you can from this dataset. You can also focus on the 10% of this dataset contained in kddcup.data_10_percent.gz (or as much of this 10% dataset as your computer memory allows). Note that the target classification attribute (let's call it attack_type) is nominal. Its values appear in the first line of kddcup.names, and it is the last column of kddcup.data_10_percent.gz.
- Challenge: The objective of this project is to construct a model with the highest prediction accuracy possible for each of the following two challenges:
  - Main Challenge (invest most of your effort on this): Predicting if there is an attack or not. That is, assuming that the classification target is boolean (attack or not attack).
  - Secondary Challenge (invest less effort on this): Predicting the type of attack (including "normal"). That is, using all the given values for the classification attribute: back,buffer_overflow,ftp_write,guess_passwd,imap,ipsweep,land,loadmodule, multihop,neptune,nmap,normal,perl,phf,pod,portsweep,rootkit,satan,smurf,spy, teardrop,warezclient,warezmaster.

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2010 Homework and Project 2: Data Pre-processing, Mining, and Evaluation of Decision Trees

PROF. CAROLINA RUIZ

HOMEWORK AND PROJECT OBJECTIVES

HOMEWORK AND PROJECT ASSIGNMENTS

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2010
Homework and Project 2: Data Pre-processing, Mining, and Evaluation of Decision Trees