CS 539 Spring 2005

Computer Science Department

CS539 Machine Learning - Spring 2005
Project 2 - Decision Trees

PROF. CAROLINA RUIZ

Due Date: Thursday, Feb. 3rd 2005. Slides are due at 3:00 (by email) and Written Report is due at 4:00 pm (beginning of class).

Project Description
Project Assignment
Report Submission and Due Date

PROJECT DESCRIPTION

Construct the best (i.e., most accurate and/or smaller and/or most readable) decision tree you can for predicting the class attribute for each of the following datasets:

The Automobile Database taken for the UCI Machine Learning Repository.
The Covertype data taken from UCI Machine Learning Repository.

PROJECT ASSIGNMENT

Read Chapter 3 of the textbook about decision trees in great detail.
The following are guidelines for the construction of your decision tree:
- Code: You can use the decision tree methods implemented in the Weka system. Use ID3 and J4.8 for your experiments. Read the Weka code implementing ID3 and J4.8 in detail.
- Training and Testing Instances:
  You may restrict your experiments to a subset of the instances IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your decision tree is, the better.
- Preprocessing of the Data: A main part of this project is the PREPROCESSING of your dataset.
  - For both ID3 and J4.8: You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contained a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionalit you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them into Weka if you wish).
    To the extent possible, modify the attribute names and the value names so that the resulting decision trees are easier to read.
  - For J4.8: Read J4.8's code to determine how J4.8 handles numeric attributes, missing values, etc. if they are present in the dataset. Also compare the performance of J4.8 when you allow it to handle numeric attributes and missing values automatically vs. its performance when you pre-process the data to handle those cases.
- Evaluation and Testing: Experiment with different testing methods:
  1. Supply separate training and testing data to Weka.
  2. Supply training data to Weka and experiment with several split ratios.
  3. Use n-fold crossvalidation to test your results Experiment with different values for the number of folds.
- Prunning of your decision tree:
  Read Weka's ID3 and J4.8 code to determine what type of post-processing techniques they offered to increase the classification accuracy and/or to reduce the size of the decision tree. Describe that functionality in detail in your written report and experiment with this functionality. Alter Weka's code if you want to tailor it to your needs.

REPORT AND DUE DATE

Written Report.
Your report should contain the following sections with the corresponding discussions:
1. Code Description: Explain the algorithm underlying the ID3 and the J4.8 code in terms of the input they receive, the output they produce, and the main steps they follow to produce their output.
2. Data: Describe the dataset that you selected in terms of the attributes present in the data, the number of instances, missing values, and other relevant characteristics.
  Provide a detail description of the preprocessing of your data. Justify the preprocessing you apply and why the resulting data is the appropriate one for mining decision trees from it.
3. Experiments: For each experiment you ran describe:
  - Data: What data did you use to construct and test your decision tree?
  - Any additional pre or post processing done to the data or the tree in order to improve the accuracy of your tree.
  - Accuracy of the resulting decision tree.
  - Discuss how this accuracy compares with that of your most accurate ZeroR experiment from the previous assignment.
4. Summary of Results
  - What was the accuracy of the most accurate decision tree constructed in your project?
  - Include the first 60 lines or so of the best tree you obtained in your report.
  - strengths and the weaknesses of your project.
Oral Report. We will discuss the results from the individual projects during the class on February 3. Your oral report should summarize the different sections of your written report as described above. Each of you will have 7 minutes to explain your results and to discuss your project in class. Be prepared!

CS539 Machine Learning - Spring 2005 Project 2 - Decision Trees

PROF. CAROLINA RUIZ

PROJECT DESCRIPTION

PROJECT ASSIGNMENT

REPORT AND DUE DATE

CS539 Machine Learning - Spring 2005
Project 2 - Decision Trees