CS 525M Fall 2001

Computer Science Department

CS 525M KNOWLEDGE DISCOVERY AND DATA MINING
PROJECT 3 - Decision Trees. Fall 2001

PROF. CAROLINA RUIZ

DUE DATE: This project is due on Thursday Nov. 8, 2001 at 1 pm.

Project Description
Project Assignment
Report Submission and Due Date

PROJECT DESCRIPTION

After select one of the attributes in your dataset as the target attribute for classification, construct the most accurate decision tree you can for predicting that target attribute.

PROJECT ASSIGNMENT

The following are guidelines for the construction of your decision tree:

Code: You can use the decision tree methods implemented in the Weka system. I recommend using ID3 for your experiments.
Training and Testing Instances:
You may restrict your experiments to a subset of the dataset IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your system is, the better.
A main part of this project is the PREPROCESSING of your dataset. You must apply relevant concept hierarchies and generalizations to your dataset before doing the mining and/or using the results of previous mining tasks (e.g. project 2, initial minings of the data using ID3). Your report should contained a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).
Use n-fold crossvalidation to test your results. I recommend using n=4, but you may use a different value IF needed given your dataset (you'll need to justify a different selection in your report).

REPORT AND DUE DATE

Written Report. Your written report is due at 1 pm. Please leave a hardcopy of your report in my mailbox (CS Office, FL231) by the due date/time. Please note that my mailbox is the one BELOW the label marked with my last name RUIZ. In the EXCEPTIONAL case that you cannot go to the CS office, email your report to me by noon. Only under EXCEPTIONAL circumstances electronic submissions willl be accepted.
Your report should contain the following sections with the corresponding discussions:
1. Code Description: Describe the decision tree code that you used from Weka. Explain the algorithm underlying the code in terms of the input it receives and the output it produces, and the steps it follows to produce this output.
2. Data: Describe the dataset that you selected in terms of the attributes present in the data, the number of instances, missing values, and other relevant characteristics.
  Provide a detail description of the preprocessing of your data. Justify the preprocessing you apply and why the resulting data is the appropriate one for mining decision trees from it.
3. Experiments: For each experiment you ran describe:
  - Data: What data did you use to construct and test your decision tree?
  - Any additional pre or post processing done to improve the accuracy of your tree.
  - Accuracy of the resulting decision tree.
4. Summary of Results
  - What was the accuracy of the most accurate decision tree constructed by your system?
  - Include the most accurate tree you obtained in your report.
  - strengths and the weaknesses of your system.
Oral Report. We will discuss the results from the individual projects during the class on November 8th. Be ready to show your results and to discuss your project in class. PREPARE OVERHEAD TRANSPARENCIES SHOWING YOUR WORK.

CS 525M KNOWLEDGE DISCOVERY AND DATA MINING PROJECT 3 - Decision Trees. Fall 2001

PROF. CAROLINA RUIZ

PROJECT DESCRIPTION

PROJECT ASSIGNMENT

REPORT AND DUE DATE

CS 525M KNOWLEDGE DISCOVERY AND DATA MINING
PROJECT 3 - Decision Trees. Fall 2001