CS 4445 B Term 2014

Computer Science Department

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2014
Homework and Project 2: Data Pre-processing, Mining, and Evaluation of Decision Trees

Prof. Carolina Ruiz

DUE DATES: Thursday, Nov. 13, 9:00 am (electronic submission) and 11:00 am (hardcopy submission)

Homework and Project Objectives
Homework Assignment
Project Assignment

HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is multi-fold:

To gain experience pre-processing datasets to clean, normalize, and discretize data attributes, and, when needed, reduce the dimensionality of the data.
To gain experience with the construction of decision trees.
To gain experience with the evaluation of the models/patterns constructed with a data mining technique.
To gain familiarity with the Weka system, its GUI, its code, and its input data format (arff).

HOMEWORK AND PROJECT ASSIGNMENTS

Readings: Read in great detail Chapter 4 (except for Section 4.6) of your textbook.

This project consists of two parts:

Part I. INDIVIDUAL HOMEWORK ASSIGNMENT
See solutions by Chiying Wang.
Consider the following dataset.
@relation simple-weather @attribute outlook {sunny,overcast,rainy} @attribute humidity numeric @attribute windy {TRUE,FALSE} @attribute play {yes,no} @data sunny, 80, FALSE, no sunny, 90, TRUE, no overcast, 80, FALSE, yes rainy, 96, FALSE, yes rainy, 80, FALSE, yes rainy, 72, TRUE, no overcast, 72, TRUE, yes sunny, 96, FALSE, no sunny, 72, FALSE, yes rainy, 80, FALSE, yes sunny, 72, TRUE, yes overcast, 90, TRUE, yes overcast, 80, TRUE, yes rainy, 96, TRUE, no
where the play attribute is the classification target.

(30 points) Construct the full ID3 decision tree using entropy to rank the predicting attributes (outlook, humidity, and windy) with respect to the target/classification attribute (play). Keep humidity as a numeric attribute (do not discretize it). Show all the steps of the calculations. Make sure you compute log in base b (for the appropriate b) correctly as some calculators don't have a log_b primitive for all b's. Also, state explicitly in your tree what instances exactly belong to each tree node using the line numbers provided next to each data instance in the dataset above.

(5 points) Propose approaches to using your decision tree above to classify instances that contain missing values. Use the following instance to illustrate your ideas.
outlook = ?, humidy = 80, and windy = FALSE.

Study how J4.8 performs post-prunning by reading in detail:

your textbook
the HW1 solutions B term 2006
the HW1 solutions A term 2004.
Part II. GROUP PROJECT ASSIGNMENT
- Project Instructions: THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.
- Data Mining Technique(s): We will run experiment using the following techniques:
  - Pre-processing Techniques:
    - Feature selection, feature creation, dimensionality reduction, noise reduction, attribute discretization, ...
  - Data Mining Techniques: (include a description of the Weka code implementing the J4.8 technique in your written report):
    - Zero-R
    - One-R
    - Decision Trees: J4.8.
      Given that J4.8 is able to handle numeric attributes and missing values directly, make sure to run some experiments with no pre-processing and some experiments with pre-processing, and compare your results. Experiment also with pre- and post-prunning of the J4.8 decision tree to see if they increase the classification accuracy.
  - Advanced Techniques:
    - You can consider using advanced techniques to improve the accuracy of your predictions. For instace, you can try ensemble methods (see Section 5.6 of your textbook), ways to deal with inbalanced classification targets (see Section 5.7 of your textbook), etc. But, in terms of data mining techniques, this project is restricted to Zero-R, One-R, and decisions trees.
- Dataset: We will work with the same dataset used in project 1: Census-Income (also known as "Adult") Dataset available from the Univ. California Irvine (UCI) Machine Learning Repository.
  In particular,
  - Use the data in the file "adult.data", and the description of the data in the file "adult.names".
  - Use the nominal attribute salary (called >50K, .50K in the data files) as the classification target.
  - remove the fnlwgt attribute from the dataset.
- Challenges: In each of the following challenges provide a detailed description of the preprocessing techniques used, the motivation for using these techniques, and any hypothesis/intuition gained about the information represented in the dataset. Answer the question provided as well as provide the information described in the PROJECT GUIDELINES.
  1. Easy Level: This is to be a simple guided experimentation, thus little description is needed for preprocessing techniques.
    Begin with the dataset without any additional pre-processing (other than removing fnlwgt). Create a preliminary decision tree model using Weka's implementation of J4.8 using the default parameters. Use salary as the classificiation target.
    Now apply supervized discretization to all the numeric attributes. Create a new decision tree model using Weka's implementation of J4.8 again with salary as the target attribute. Examine your two models. Compare and contrast them. Use 10-fold cross-validation to perform an analysis of the classification accuracy. Answer the following questions in your description about this experiment:
    1. What attribute and values were at the root node of each decision tree? Are they the same?
    2. Can you justify why the attributes and values might or might not be the same? Describe data and/or visualizations that support your justification.
    3. How does the overall structure of the two decision trees differ? How are they similar? Compare the size of the trees. Compare the attributes that appear near the root node.
    4. Which of the two models performed more accurately in correctly classifying the target attribute?
  2. Moderate Level: This is a bit more of a challenge. Begin again with the dataset without any additional pre-processing (other than removing fnlwgt). Provide descriptions of any additional preprocessing that you performed. Provide descriptions about the parameters used to develop your model. One should be able to repeat the experiment from your description.
    Use preprocessing and postprocessing techniques to generate a J4.8 decision tree that predicts salary as accurately as possible, but with 40 or fewer leaf nodes. Include an image of the tree in your report. In searching for this model, you must experiment with:
    Preprocessing: Experiment with and without: (1) attribute discretization; (2) replacing missing values; (3) attribute selection (Correlation-based Feature Selection); (4) feature reduction: For this, replace the numeric attributes in the dataset with the components resulting from applying PCA to just these numberic attributes. Keep the nominal attributes intact.
    Parameter Values and Postprocessing: Vary the values of the J4.8 parameters: binarySplit, confidenceFactor, minNumObj, reducedErrorPruning, subtreeRaising, and unpruned.
    Examine the model. Compare and contrast this model against a ZeroR model, a OneR model, and models generated in the Easy Level challenge above. Answer the following questions in your description of this experiment:
    1. Is this model a better model than the other models? If so, why? If not, why not? Do not limit yourself to just commenting on the tree's accuracy.
    2. Describe any interesting patterns or branches contained in the tree.
    3. What challenge(s) did you encounter while developing this model? Give an more detailed explanation of how you used preprocessing, postprocessing, or some other technique to overcome a specific challenge.
  3. WPI Level: Repeat the process described on the Moderate Level above but now using the attribute sex as the target attribute rather than salary. Answer the same questions that are asked in the Moderate Level. But now that you have more experience, add at least one new idea not included in the description above to try to improve your model. State that idea (or ideas) in your report.

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2014 Homework and Project 2: Data Pre-processing, Mining, and Evaluation of Decision Trees

Prof. Carolina Ruiz

HOMEWORK AND PROJECT OBJECTIVES

HOMEWORK AND PROJECT ASSIGNMENTS

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2014
Homework and Project 2: Data Pre-processing, Mining, and Evaluation of Decision Trees