WPI Worcester Polytechnic Institute

Computer Science Department

CS539 Machine Learning - Spring 2003 
Project 2 - Decision Trees


Due Date: Monday, Jan. 27 2003 at 8 am. 


Construct the most accurate decision tree you can for predicting the class attribute (CARAVAN Number of mobile home policies) in the
The Insurance Company Benchmark (COIL 2000) dataset.


  1. Read Chapter 3 of the textbook about decision trees in great detail.

  2. The following are guidelines for the construction of your decision tree:

    • Code: You can use the decision tree methods implemented in the Weka system. I recommend using ID3 for your experiments. Read the Weka code implementing ID3 in detail. Look also at the J48 classifier.

    • Training and Testing Instances:

      Use the ticdata2000.txt data for training and the ticeval2000.txt data for testing. You may restrict your experiments to a subset of the instances IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your decision tree is, the better.

    • Preprocessing of the Data:

      A main part of this project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contained a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).

      To the extent possible, modify the attribute names and the value names so that the resulting decision trees are easier to read.

    • Evaluation and Testing: Experiment with different testing methods:

      1. Supply separate training (ticdata2000.txt) and testing (ticeval2000.txt) data to Weka.

      2. Supply training (ticdata2000.txt or ticdata2000.txt + ticeval2000.txt) data to Weka and experiment with several split ratios.

      3. Supply training (ticdata2000.txt or ticdata2000.txt + ticeval2000.txt) data to Weka and

      4. Use n-fold crossvalidation to test your results Experiment with different values for the number of folds.

    • Prunning of your decision tree:

      Determine (by reading Weka's ID3 code) whether or not Weka performs any pre- or post-prunning of the decision tree in order to increase the classification accuracy and/or to reduce the size of the decision tree. If so, experiment with this functionality. Modify the code if needed to allow for pre- and/or post-prunning of the tree. Also, experiment with Weka's J48 classifier.