CS 539 Spring 2003

Computer Science Department

CS539 Machine Learning - Spring 2003
Project 2 - Decision Trees

PROF. CAROLINA RUIZ

Due Date: Monday, Jan. 27 2003 at 8 am.

Project Description
Project Assignment
Report Submission and Due Date

PROJECT DESCRIPTION

Construct the most accurate decision tree you can for predicting the class attribute (CARAVAN Number of mobile home policies) in the The Insurance Company Benchmark (COIL 2000) dataset.

PROJECT ASSIGNMENT

Read Chapter 3 of the textbook about decision trees in great detail.
The following are guidelines for the construction of your decision tree:
- Code: You can use the decision tree methods implemented in the Weka system. I recommend using ID3 for your experiments. Read the Weka code implementing ID3 in detail. Look also at the J48 classifier.
- Training and Testing Instances:
  Use the ticdata2000.txt data for training and the ticeval2000.txt data for testing. You may restrict your experiments to a subset of the instances IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your decision tree is, the better.
- Preprocessing of the Data:
  A main part of this project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contained a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).
  To the extent possible, modify the attribute names and the value names so that the resulting decision trees are easier to read.
- Evaluation and Testing: Experiment with different testing methods:
  1. Supply separate training (ticdata2000.txt) and testing (ticeval2000.txt) data to Weka.
  2. Supply training (ticdata2000.txt or ticdata2000.txt + ticeval2000.txt) data to Weka and experiment with several split ratios.
  3. Supply training (ticdata2000.txt or ticdata2000.txt + ticeval2000.txt) data to Weka and
  4. Use n-fold crossvalidation to test your results Experiment with different values for the number of folds.
- Prunning of your decision tree:
  Determine (by reading Weka's ID3 code) whether or not Weka performs any pre- or post-prunning of the decision tree in order to increase the classification accuracy and/or to reduce the size of the decision tree. If so, experiment with this functionality. Modify the code if needed to allow for pre- and/or post-prunning of the tree. Also, experiment with Weka's J48 classifier.

REPORT AND DUE DATE

Written Report.
Your report should contain the following sections with the corresponding discussions:
1. Code Description: Describe the decision tree code that you used from Weka. Explain the algorithm underlying the code in terms of the input it receives and the output it produces, and the main steps it follows to produce this output.
2. Data: Describe the dataset that you selected in terms of the attributes present in the data, the number of instances, missing values, and other relevant characteristics.
  Provide a detail description of the preprocessing of your data. Justify the preprocessing you apply and why the resulting data is the appropriate one for mining decision trees from it.
3. Experiments: For each experiment you ran describe:
  - Data: What data did you use to construct and test your decision tree?
  - Any additional pre or post processing done to the data or the tree in order to improve the accuracy of your tree.
  - Accuracy of the resulting decision tree.
  - Discuss how this accuracy compares with that of your most accurate ZeroR experiment from the previous assignment.
4. Summary of Results
  - What was the accuracy of the most accurate decision tree constructed in your project?
  - Include the most accurate tree you obtained in your report.
  - strengths and the weaknesses of your project.
Oral Report. We will discuss the results from the individual projects during the class on January 27. Your oral report should summarize the different sections of your written report as described above. Each of you will have 5 minutes to explain your results and to discuss your project in class. Be prepared!
Submission and Due Date.
Please submit the following files by email to ruiz@cs.wpi.edu by 8:00 am on Monday, January 27 2003. Submissions received on Monday, Jan. 27 between 8:01 am and 10:00 am will be penalized with 30% off the grade and submissions after 10:00 am won't be accepted.
1. [your-lastname]_proj2_report.pdf containing your written report in PDF. For instance my file would be named ruiz_proj2_report.pdf.
2. [your-lastname]_proj2_slides.[ext] containing your slides for your oral report. This file should be either a PDF file (ext=pdf) or a PowerPoint file (ext=ppt).

CS539 Machine Learning - Spring 2003 Project 2 - Decision Trees

PROF. CAROLINA RUIZ

PROJECT DESCRIPTION

PROJECT ASSIGNMENT

REPORT AND DUE DATE

CS539 Machine Learning - Spring 2003
Project 2 - Decision Trees