WPI Worcester Polytechnic Institute

Computer Science Department

CS539 Machine Learning - Spring 2005 
Project 2 - Decision Trees


Due Date: Thursday, Feb. 3rd 2005. Slides are due at 3:00 (by email) and Written Report is due at 4:00 pm (beginning of class). 


Construct the best (i.e., most accurate and/or smaller and/or most readable) decision tree you can for predicting the class attribute for each of the following datasets:

  1. The Automobile Database taken for the UCI Machine Learning Repository.

  2. The Covertype data taken from UCI Machine Learning Repository.


  1. Read Chapter 3 of the textbook about decision trees in great detail.

  2. The following are guidelines for the construction of your decision tree:

    • Code: You can use the decision tree methods implemented in the Weka system. Use ID3 and J4.8 for your experiments. Read the Weka code implementing ID3 and J4.8 in detail.

    • Training and Testing Instances:

      You may restrict your experiments to a subset of the instances IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your decision tree is, the better.

    • Preprocessing of the Data: A main part of this project is the PREPROCESSING of your dataset.

      • For both ID3 and J4.8: You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contained a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionalit you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them into Weka if you wish).

        To the extent possible, modify the attribute names and the value names so that the resulting decision trees are easier to read.

      • For J4.8: Read J4.8's code to determine how J4.8 handles numeric attributes, missing values, etc. if they are present in the dataset. Also compare the performance of J4.8 when you allow it to handle numeric attributes and missing values automatically vs. its performance when you pre-process the data to handle those cases.

    • Evaluation and Testing: Experiment with different testing methods:

      1. Supply separate training and testing data to Weka.

      2. Supply training data to Weka and experiment with several split ratios.

      3. Use n-fold crossvalidation to test your results Experiment with different values for the number of folds.

    • Prunning of your decision tree:

      Read Weka's ID3 and J4.8 code to determine what type of post-processing techniques they offered to increase the classification accuracy and/or to reduce the size of the decision tree. Describe that functionality in detail in your written report and experiment with this functionality. Alter Weka's code if you want to tailor it to your needs.